Closed cmgrote closed 3 years ago
I thought it worth setting the scene of where we are at with our repos.
It's perhaps worth considering what structure we want moving forward - of course this may change, but deciding at the top level of repos (vs smaller details of builds, artifacts, modules) is significant
My initial reaction is that on a very specific question of whether we should have a repo that aids end users in code samples, I would absolutely agree, as it's unit they can fork, clone, modify, build.
Taking just one example, like sample metadata or helm charts there are many levels
However the list above goes a lot beyond this, so I think we need to take a step back:
Current Repos
We have a number of project for or used by egeria
Some have a different scope within ODPi but have some relevance to us
Whilst others relate to other activities within the ODPi - either past or present:
For the purpose of this discussion I suggest focussing on the first set only.
It's also worth noting there's only a partial naming convention - and moving forward we should have such a convention
It's worth noting we have numerous other outlets (and arguably) repositories for our work including
But for this discussion will focus on the core artifacts under source code management
Users
Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example
For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations
Types of artifact
We have all kinds of artifacts in our repositories/systems. For example (not a complete list!)
Additionally some artifacts relate to open source components, whilst some are proprietary
Other considerations
Github Repo
Within a single github (and hence git) repo:
Whilst some of these can span repos, this becomes harder to coordinate. Simple cases might include having a seperate repo for build or test.
Git does support the concept of a sub-project where another project can be pulled in/updated on demand (a little like updating a dependency version) - but this adds complexity
Problems noted recently
Some of the third party code we pull in is somewhat dirty in terms of security scans/quality & results may contaminate egeria's position, especially when those areas are used only by a small minority of users. Ranger/Hadoop for example, also potentially Gaian
Code samples require pulling the main repo. The samples are written to build as part of the overall project. This is different to what someone consuming egeria probably wants, which is an example of how they would build their own project that consumes egeria. It's also more complex to explain
Trying to build a map of these requirements
It's probably time to consider the mantra 'less is more'. But does that mean less in a repository (clear, targetted) or less repositories!
........ NEXT POST to follow
I should add that any conclusion we reach might be good to document (but where... !!)
I should add that any conclusion we reach might be good to document (but where... !!)
Right here on this issue, for starters 😄
Excuse the long post - a bit of a brain dump.
First point of note - anything in these repos needs to be suitably general and not super specific to one particular environment. If we have artifacts of those nature they belong with the org that created them, OR we have a 'contrib' git repo which has minimal rules/overhead on contributions
My view here is that fundamentally, keeping our assets together makes them easier to manage, support, release. As such the default is that we should continue using a single main repository ie egeria
Third party code
Some assets are quite different from core egeria - as per some of the factors discussed above
Candidates for this include
igc & datastage connectors- IGC is a proprietary component, to use requires that component to be licensed (not our connector). Specific ansible or other scripts that combine igc/datastage/egeria belong here
hadoop integrations - Whilst both Atlas and Ranger are independent Apache projects, they are most commonly used as part of a broader hadoop environment, alongside hive & others. This includes the Atlas connector, as well as the ranger governance engine plugin. It would also include the docker builds for atlas and ranger
cassandra - connectors for a) omas integration to capture metadata about cassandra resources and b) using cassandra as a backing store for JanusGraph, either for egeria's repository or open lineage services
Would need a build and test pipeline. Could be quite varied by component, but probably unfeasible to have a repo for each distinct connectors
it may be anything small/manageable we keep in the base, but those with a more complex nature we have separate, so we could stick with the existing repo for igc/datastage, add a new one for hadoop (as many moving parts), but keep cassandra in the core (less invasive, open source)
Sample projects
No need to publish any artifacts but instead assume a developer will just clone.
This looks like an idea candidate for a new repo - though could continue to be managed in core egeria if we were to create maven archaetypes/write standalone poms. On balance though I'd split off.
egeria
egeria-deployment
We have various artifacts that are closely related to core egeria, but don't technically form part of it
But this is closely related to core egeria, there is also some interdependency when we do testing. So I am not certain we should do this split - and if not the above belong in egeria itself
I believe strongly that if we do this (and even if we don't) we do not have any dependencies on other projects - so we do not include tests around hadoop, igc, or indeed the helm charts around vdc.
I'm 50/50 on this one
egeria-vdc
We have a few pieces left over
gaian
docker images for atlas, Ranger
helm chart for vdc (and related artifacts) - since this has ranger, and ibm content-
Do we roll these into third party?
Do we create a new project for scenarios?
They are an uncomfortable fit in base egeria already - for example the vdc chart depends on ranger (which we're moving out) and the ibm conectors (which have). That being said it's not a build time issue. yet.
Or is it so special we put them in it's own repo - which can then depend on others (logically)
To return back to Chris's proposal:
Existing sample data and metadata (eg. under open-metadata-resources/open-metadata-deployment/sample-data)?
Existing Ansible playbook-based loading / clearing (also under open-metadata-resources/open-metadata-deployment/sample-data/...)?
Existing Helm chart-based deployment (under open-metadata-resources/open-metadata-deployment/charts)?
Existing Compose-based deployment (under open-metadata-resources/open-metadata-deployment/compose)?
Existing Docker image builds (under open-metadata-resources/open-metadata-deployment/docker)?
Existing tutorials (under open-metadata-resources/open-metadata-tutorials)?
Sounds reasonable, just a few grey areas that come to mind that leave me undecided:
My gut feeling is that we may need to create a more well-defined plan for what these various samples are, where they overlap, and therefore where this commonality is kept vs. the sample-by-sample variation -- I think conceptually the same as you're suggesting on the third party vs not split above -- but I'm less certain this will fall strictly along third party vs not lines (as the VDC example sort of exposes)...
In favour of splitting some of these areas out from core (or indeed any currently-existing Java-based repo) entirely, another point to consider is the build process itself: currently any commits against these areas (sample files, scripts (whether for deployment or tutorials: Ansible, Docker, Helm, Python), etc) all require a full build of the core itself to be completed before they can be considered and merged -- yet none of them directly influence the Java code being built by the PR / merge process. As the community broadens, and our Java code-base continues to expand (and thus the builds go from 15-20 minutes to longer), could we be unnecessarily delaying contribution and evolution of these tutorials / deployment / samples areas by forcing them to exist in a Java code tree?
Could we also be missing out on more optimal approaches to the linting, static code analysis, vulnerability detection, building, automated testing, etc of these areas (eg. given current complete reliance on Maven for build and Sonar's Java profile for scanning)?
Users
Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example
- Core development team
- Additional developers wishing to contribute occasionally, or aspire to do more a) New b) experienced
- Those who just wish to use and/or understand our code from an external perspective
- Those who want to read about what we're doing & follow the project
For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations
Hello everybody, I don't know whether this comment could be meaningful about the quoted problem or not but I hope so.
I am following Egeria since a couple of months and I am a huge fan of it: it addresses many of the problems I have to deal with almost every day..
I'd like to be able to keep up with the developments and being able to contribute somehow but it's a VERY broad project (with many complex architectural components: basically I feel a little bit afraid of stepping in because of the complexity of the entire project (and also because I could contribute to it only in my spare time currently).
What I'd like from Egeria is to help people like me to understand how to get comfortable with all of its modules in an incremental way...although this is a very complex and time-consuming task as well..
@fpompermaier thanks for the post. Initially this is probably more general than repos (though ensuring the structure makes following easier not harder). - can we continue this via slack.(slack.odpi.org & try channel #egeria-discussions ). If that doesn't work perhaps a separate issue - just to help keep this one focussed on the repo specifically? Def value your thoughts and let's persue!
One option we could take both for samples and connectors that are
would be to stop having a top level pom that tries to build everything as a single multi-module project, but instead contains a few independent file trees which may not even be pom based
needs some more thinking about...
Revisiting this since at the moment vdc is an awkward fit. We're not focussed on it, we have many issues open, it has dependencies on other repos (connectors)
Core egeria currently has docker build support for ranger & atlas (from source) as well as gaian (from a tar). These are not really part of egeria itself and IMO should be outside egeria. Not only that but the integration is incomplete and not a current focus. This of course not apply to the core egeria docker image which is absolutely part of egeria
The ranger/atlas aspects can be moved to the egeria hadoop repository
We have a vdc helm chart which makes use of core egeria, the hadoop content, & the ibm content. Mostly it works well for ibm igc & egeria, but as above the hadoop components are not complete. We could move the vdc chart there and cut it down JUST to an igc/egeria helm chart. In parallel the lab chart would remain in core egeria, as would new work on operators. At some point the igc chart could be refactored to pull in a subchart based around egeria/operator or directly consumer operator
The vdc chart could also be stripped of IGC and placed in the hadoop repository though with atlas not currently working, and many components unconfigured this isn't necessarily that useful until we put more effort into supporting that environment - which I don't see in the short term
An alternative would be to create a new repo just for the helm chart (as it depends on egeria+ibm+hadoop) - this may be easier in the short term
Further breakdown is possible - like moving out the docker egeria image, lab chart, operator, samples, but I don't think we're yet at the point of this being compellingly obvious
So to clarify my proposal
The result is
Further long term refactoring may be needed as we figure out where we are going with 'vdc', virtualization, operators/containers/charts, igc
^^ @cmgrote @mandy-chessell
Its a good temproary solution
I'm not sure I follow. An entire repository just for the VDC chart seems like a lot of overhead (?) Why not a repository for all assets that don't have any build requirement (all Helm charts, sample data, Ansible automation, etc)?
Fine with moving the build stuff for Ranger, etc to the other repo -- thought we'd already agreed that long ago...
Ranger - all good there. just checking
The non-build content? The reason I hadn't yet suggested all was primarily that we still have our tutorials for docker-compose, and helm in the base repo. We know they've been successful with people using them as part of getting up and running with egeria. The question is whether splitting them away from the core is helpful or not.
The 'lab' environment is much more focussed on core egeria, so has less external dependencies/noise - though it does include jupyter, and in future may need ldap, whilst vdc has much more 3rd party content (hadoop & ibm)
For the purest split I'd agree with you, but the rationale for just moving the vdc chart for now is that it's a particular pain point, across dependencies, build, future direction. At least if it's a small and self contained repo we can easily move again or rename. We can also have a longer brainstorm about what really is best .
So I still think we should sort out the vdc chart, ranger repo first
Discussed again on today's call -- suggestion is to move the following to start with, and then we can get into the detail of any others as-needed later on (name the repo egeria-samples
):
Phase 1:
Phase 2 (more interdependencies):
Phase 3:
New repo here: https://github.com/odpi/egeria-samples
Phase 1 should be complete with the merge of #3631 -- we may want to leave this open to track subsequent phases, though?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
Keeping this open -- we should move the coco pharma samples, for example the labs, in the not too distant future
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
Still work in progress
@mandy-chessell As discussed, I think we can move the coco/samples out of base egeria to the samples repo. - note that the docker build for jupyter assembles our notebooks into the docker image this needs to be done together or we will break the merge build (not PR build.. so easy to miss)
We can also move the helm charts (certainly for coco - there is another more basic one)
The docker image for egeria itself should stay as it forms part of the deliverable of the base repo, but the other docker images should move.
@mandy-chessell is looking at the initial refactoring
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
We've discussed on a number of calls for a while now the potential for creating a separate code repository for our "samples", but there did not seem to be an explicit issue for this decision (hence adding this one).
Part of what we need to decide is what should be included under this umbrella term of "samples":
open-metadata-resources/open-metadata-deployment/sample-data
)?open-metadata-resources/open-metadata-deployment/sample-data/...
)?open-metadata-resources/open-metadata-deployment/charts
)?open-metadata-resources/open-metadata-deployment/compose
)?open-metadata-resources/open-metadata-deployment/docker
)?open-metadata-resources/open-metadata-tutorials
)?Related: #1643