Closed planetf1 closed 4 years ago
The issue is simply that the assemblies are not marked as dependent on some modules including where the samples, cim import etc are. This means the assembly gets built BEFORE these extra samples/utils - so they are not included ON THE FIRST RUN. On subsequent runs - ie in the developer's environment, they will be picked up on asecond run - albeit an old version ....
Of course the official build is clean every run, so the extra will always be missing
The fix should be to add the correct dependencies.
This brings up an interesting issue -- how exactly to express the right dependencies to ensure everything needed is built ready for assembly, and that other modules are then built using this assembly.
Most of our code is in open-metadata-implementation. However we also have some components under open-metadata resources, such open-metadata-archives (utilities for CIM information model etc), lab notebooks in open-metadata-labs & a series of samples in open-metadata-samples including for example the coco pharma security provider.
With the main assembly built, our docker image inserts this into the image. The docker image themselves are then used by our helm charts and docker-compose .. yet these modules themselves are under open-metadata-resources/open-metadata-deployment
Currently the assembly is not really dependent on anything, but happens to work. Simple adding dependencies on the obvious 'parent' pom is actually not sufficient - for example see https://stackoverflow.com/questions/11131575/build-order-of-maven-multimodule-project. The files were include are referenced by relative path to the assemblies directory (defeating the dependency mechanism)
One approach is to ensure that WHENEVER we include a file into the assembly, we must include the exact artifact that originates in -- not a parent , but the actual artifact as a dependency. It appears though the assembly descriptor already effectively supports this - see https://maven.apache.org/plugins/maven-assembly-plugin/examples/multimodule/module-binary-inclusion-simple.html - we should be replacing file paths with module dependencies.
A slightly cleaner way may be to assume most things we build could contribute, and restructure so that a parent project does the assembly (and makes use of useAllReactorProjects
as per https://maven.apache.org/plugins/maven-assembly-plugin/examples/multimodule/module-binary-inclusion-simple.html), with all our implementation (+samples etc etc) as submodules. Docker images/charts would be out of that tree, but this is a significant rejigging of the directory structure. Even this has some challenges as per https://maven.apache.org/plugins/maven-assembly-plugin/faq.html#module-binaries
I will look at refactoring the assembly to
I hope this will be the 'correct' way to let maven properly resolve dependencies and include just what is needed (without getting into circles unnecessarily). An additional thing to check is that our FVTs (which depend on open-metadata-assemblies) continue to be launched at the correct time, since they are dependent on the assembly, just as our docker image is.
Unfortunately our structure is now complex enough that the initial incorrect useage is now breaking
Note - have split out corrections/improvements to the assembly we use for jupyter (notebooks) and the node based ui/presentation server to seperate issues as there are other factors to look at.
The changes for this issue will focus on the core egeria distribution.
An update on where this work is at so far
However there are a few non trivial issues
How we correct this? I think we need to look again at project dependencies. A few ways:
At this point the server is seeming to be working correctly.
In essence I think this is doing things closer to 'the right way' and is actually exposing hidden issues we currently have
At this point I'm looking for feedback on the general direction. We need to consider whether these changes make sense as a direction. Then, if so, whether we can integrate now & work on mitigations later or if we need to hold back until more of these issues are addressed
See https://github.com/odpi/egeria/pull/3316 for the current PR that implements these changes (just finalizing a few build errors)
On investigation it appears we are getting every dependency pulled in regardless of what modules are put in any directory of the distribution. Following up with maven team -> https://lists.apache.org/thread.html/r131e0699cb3d60763d0904425491924097168bb2de68da24bb5ddcb5%40%3Cusers.maven.apache.org%3E
Opened up another issue with maven: https://issues.apache.org/jira/browse/MASSEMBLY-940 - the dependencies being loaded are wrong. They should be base off a moduleSet (as best I can determine) NOT project.
Looking for alternatives.
I've looked again at how we might improve the maven based assembly particularly to any shorter term solutions
Without getting deep into maven and fixing/forking the assembly plugin or building a custom plugin, the assembly build is limited.
We can only pull in all assemblies from the POM level (of the assembly) or none. We cannot do moduleset specific including (ie dependencies of connectors in once place, clients in another, utilities in a third etc)
The maven download, dependency, build-helper plugins have some overlapping capabilities, but only work on a single module that is being built, there's no real way of doing aggregate changes/merges across a set of modules.
The 'provided' scope can help when building a single module as we can say 'that's ok - it's already in the platform', but since our modules can get used in different ways and pulled in transitively it can be tricky to specify this correctly which is particularly tricky without extensive testing/validation of different components at least in terms of assembly structure & dependencies.
Ideally for example under server/lib we might want
Clients on the other hand need to be standalone to we probably want
We might conceive of other structures or common dirs, & building aggregate jars, but this flexibility doesn't seem possible.
There seem to be few options that can be done simply as a near term measure a) Always create & use 'uber jars' for anything pluggable such as connectors, being used in an unknown way (clients), standalone (utilities etc). This probably has the least impact. We already build these jars, we just need to include them in artifacts, and then the assembly. Very little work and reliable at the expense of size/time
b) Have a general 'lib' directory into which we add ALL dependencies - but this will be > 400 including duplication of server components. Again only minor change to the build/server to ensuring the correct LOADER_PATH is specified, but it would mean client/utility uses specifying classpath correctly or still using some uber jars. Overall not as simple. There is also a noteable additional impact on build time (5-10 minutes) in evaluating all the dependencies together due to their complexity
I plan to close this out persuing option a) if no comments follow so we can fix the initial issue
A possible medium term measure is to revisit packaging & dependency management across the product so that we focus more on standard 'packages' (we have some already, but don't use them, and they may be incomplete), and have a very clear perspective on which modules make up different 'platforms'. We then shift our dependencies away from fine-grained to this coarser grained level. We can then more easily make use of 'provided' scope for either common bundles or the server platform, and reduce the list of extra dependencies or increasing in size of uber jars. We might wish to do this alongside reviewing our overall build/packaging approach as it's likely to need more investment in experimentation and flexibility in our build. (see also odpi/egeria#3371). (Note also that any changes here currently do quite increase build time.)
Or we may break the direct 1:1 mapping between the jars we build and our maven artifacts outside maven itself so that the 320+ do not get published - we won't have them. Instead we keep the 320+ source jars and assemble them on-demand as part of a build process into the more select 'package' type bundles that make sense. This also could address other process challenges but would depend on a radical change like odpi/egeria#3371
So really More medium term more review of this area is needed & elaboration of the use caseswith a broader look across all artifacts (ie including docs) ie
I think that is best served by a fresh issue & analysis & being less concerned on any current technical constraints of the build process & more concerned on the end users. We can fix the near term issue here.
As per odpi/egeria#3458 we will
This has now been implemented. closing
The docker image should contain the full distribution archive, yet:
docker inspect of the image shows a few entries like:
These show the images are current
The impact to this will be that samples, utilities are missing, impacting the ability to extend the current tutorials including fixing labs like odpi/egeria-jupyter-notebooks#34 - however it is not affecting the remainder of the labs at this point