New "samples" repository?

cmgrote commented 4 years ago

We've discussed on a number of calls for a while now the potential for creating a separate code repository for our "samples", but there did not seem to be an explicit issue for this decision (hence adding this one).

Part of what we need to decide is what should be included under this umbrella term of "samples":

Existing sample data and metadata (eg. under open-metadata-resources/open-metadata-deployment/sample-data)?
Existing Ansible playbook-based loading / clearing (also under open-metadata-resources/open-metadata-deployment/sample-data/...)?
Existing Helm chart-based deployment (under open-metadata-resources/open-metadata-deployment/charts)?
Existing Compose-based deployment (under open-metadata-resources/open-metadata-deployment/compose)?
Existing Docker image builds (under open-metadata-resources/open-metadata-deployment/docker)?
Existing tutorials (under open-metadata-resources/open-metadata-tutorials)?
Others?

Related: #1643

planetf1 commented 4 years ago

I thought it worth setting the scene of where we are at with our repos.

It's perhaps worth considering what structure we want moving forward - of course this may change, but deciding at the top level of repos (vs smaller details of builds, artifacts, modules) is significant

My initial reaction is that on a very specific question of whether we should have a repo that aids end users in code samples, I would absolutely agree, as it's unit they can fork, clone, modify, build.

Taking just one example, like sample metadata or helm charts there are many levels

data and charts may focus only on core, pure egeria function
They may support testing
They may relate to external components (including connectors)

However the list above goes a lot beyond this, so I think we need to take a step back:

Current Repos

We have a number of project for or used by egeria

egeria
egeria-connector-apache-atlas
egeria-connector-ibm-information-server
data-governance
egeria-dev-projects
egeria-palisade

Some have a different scope within ODPi but have some relevance to us

tsc
artwork

Whilst others relate to other activities within the ODPi - either past or present:

website
ci-management
OpenDS4All
bi-ai
specs
bigtop
self-certification-reports
security-guide
ambari
hive
hadoop

For the purpose of this discussion I suggest focussing on the first set only.

It's also worth noting there's only a partial naming convention - and moving forward we should have such a convention

It's worth noting we have numerous other outlets (and arguably) repositories for our work including

Blog sites (including odpi)
Azure dev pipelines
maven central
Artifactory, (+ bintray, JCenter)
docker.io
Slack

But for this discussion will focus on the core artifacts under source code management

Users

Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example

Core development team
Additional developers wishing to contribute occasionally, or aspire to do more a) New b) experienced
Those who just wish to use and/or understand our code from an external perspective
Those who want to read about what we're doing & follow the project

For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations

Types of artifact

We have all kinds of artifacts in our repositories/systems. For example (not a complete list!)

Core egeria infratructure - frameworks, OMRS
Connectors (many types)
OMASs (could be many)
communities
Tools (ie archives)
Coco pharma sample data (users, files, ldap/login)
tutorials (inc. notebooks) linked with above
Clients
Docker build
Helm charts & Docker-compose definitions

Additionally some artifacts relate to open source components, whilst some are proprietary

Other considerations

Size (to clone)
Quality (especially in terms of security scanning)
Popularity
Lifecycle of artifacts inc. rate of change
read vs write
Likelihood of being closely linked with other artifacts - within a repo a change can be made atomically.
Too many repos means harder to manage, harder to find
Dependencies on other components
How general purpose
cleanliness/quality
Audience (as above)

Github Repo

Within a single github (and hence git) repo:

Branch protection rules
Github actions like dependabot, build triggers
Build, CI/CD
scanning
Release versioning / branching / tags

Whilst some of these can span repos, this becomes harder to coordinate. Simple cases might include having a seperate repo for build or test.

Git does support the concept of a sub-project where another project can be pulled in/updated on demand (a little like updating a dependency version) - but this adds complexity

Problems noted recently

Some of the third party code we pull in is somewhat dirty in terms of security scans/quality & results may contaminate egeria's position, especially when those areas are used only by a small minority of users. Ranger/Hadoop for example, also potentially Gaian
Code samples require pulling the main repo. The samples are written to build as part of the overall project. This is different to what someone consuming egeria probably wants, which is an example of how they would build their own project that consumes egeria. It's also more complex to explain

Trying to build a map of these requirements

It's probably time to consider the mantra 'less is more'. But does that mean less in a repository (clear, targetted) or less repositories!

........ NEXT POST to follow

planetf1 commented 4 years ago

I should add that any conclusion we reach might be good to document (but where... !!)

cmgrote commented 4 years ago

I should add that any conclusion we reach might be good to document (but where... !!)

Right here on this issue, for starters 😄

planetf1 commented 4 years ago

Excuse the long post - a bit of a brain dump.

First point of note - anything in these repos needs to be suitably general and not super specific to one particular environment. If we have artifacts of those nature they belong with the org that created them, OR we have a 'contrib' git repo which has minimal rules/overhead on contributions

My view here is that fundamentally, keeping our assets together makes them easier to manage, support, release. As such the default is that we should continue using a single main repository ie egeria

Third party code

Some assets are quite different from core egeria - as per some of the factors discussed above

Candidates for this include

igc & datastage connectors- IGC is a proprietary component, to use requires that component to be licensed (not our connector). Specific ansible or other scripts that combine igc/datastage/egeria belong here
hadoop integrations - Whilst both Atlas and Ranger are independent Apache projects, they are most commonly used as part of a broader hadoop environment, alongside hive & others. This includes the Atlas connector, as well as the ranger governance engine plugin. It would also include the docker builds for atlas and ranger
cassandra - connectors for a) omas integration to capture metadata about cassandra resources and b) using cassandra as a backing store for JanusGraph, either for egeria's repository or open lineage services

Would need a build and test pipeline. Could be quite varied by component, but probably unfeasible to have a repo for each distinct connectors

it may be anything small/manageable we keep in the base, but those with a more complex nature we have separate, so we could stick with the existing repo for igc/datastage, add a new one for hadoop (as many moving parts), but keep cassandra in the core (less invasive, open source)

Sample projects

Simple code samples demonstrating how to use our core APIs
Not complex scenarios
Must only depend on limited repos - like the core egeria code reposotory, and possible deployment related repo
Probably a structure of directories but without a top level parent pom, rather a series of mini projects with their own pms which a developer could just take and use
Will typically just make use of a egeria release artifacts via maven repos
Probably needs version aligning with main Code
May benefit from a build/test process to ensure the samples are correct/work. Likely light
No need to publish any artifacts but instead assume a developer will just clone.

This looks like an idea candidate for a new repo - though could continue to be managed in core egeria if we were to create maven archaetypes/write standalone poms. On balance though I'd split off.

egeria

Core code including unit tests
Egeria docker build (as this is 100% aligned with the code build)
FVT (closely related in lifecycle to core code?)

egeria-deployment

We have various artifacts that are closely related to core egeria, but don't technically form part of it

deployment of core egeria
docker image of jupyter etc (but not atlas, ranger, gaian) - enough to run our core demo/test environment
egeria tutorials, coco pharma samples
ansible script samples - but only for core egeria components

But this is closely related to core egeria, there is also some interdependency when we do testing. So I am not certain we should do this split - and if not the above belong in egeria itself

I believe strongly that if we do this (and even if we don't) we do not have any dependencies on other projects - so we do not include tests around hadoop, igc, or indeed the helm charts around vdc.

I'm 50/50 on this one

egeria-vdc

We have a few pieces left over

gaian
docker images for atlas, Ranger
helm chart for vdc (and related artifacts) - since this has ranger, and ibm content-
Do we roll these into third party?
Do we create a new project for scenarios?

They are an uncomfortable fit in base egeria already - for example the vdc chart depends on ranger (which we're moving out) and the ibm conectors (which have). That being said it's not a build time issue. yet.

Or is it so special we put them in it's own repo - which can then depend on others (logically)

To return back to Chris's proposal:

Existing sample data and metadata (eg. under open-metadata-resources/open-metadata-deployment/sample-data)?

This is mostly vdc (and now) lineage sample data, so belongs whereever we decide (above) where that belongs. We should definately not move anything needed by our core tutorial environment any further away from egeria itself (see discussion above)

Existing Ansible playbook-based loading / clearing (also under open-metadata-resources/open-metadata-deployment/sample-data/...)?

I think ansible scripts should be aligned with the data or components they require. If they can be used just for core components then we leave them in egeria. If they load igc data we put them there.

Existing Helm chart-based deployment (under open-metadata-resources/open-metadata-deployment/charts)?

I think the lab & vdc are very different. Lab is focussed on egeria code. We use it to aid in verification of egeria, and teach users about egeria without bringing in the complexity of other integrations/scenarios. VDC on the otherhand I agree could be moved

Existing Compose-based deployment (under open-metadata-resources/open-metadata-deployment/compose)?

I would leave with egeria for the same reasons

Existing Docker image builds (under open-metadata-resources/open-metadata-deployment/docker)?

Split by useage. docker is only a vehicle. the core egeria docker image should be a 1st class deliverable of egeria itself. It should stay there. If we keep coco notebooks in egeria then jupyter can stay there too. But Agree in evicting ranger/atlas/gaian

Existing tutorials (under open-metadata-resources/open-metadata-tutorials)?

I See comments on egeria above

cmgrote commented 4 years ago

Sounds reasonable, just a few grey areas that come to mind that leave me undecided:

Docker image builds (I think?) currently follow one build mechanism as they are today within the core. This has taken us some time to put in place, particularly with the appropriate version tagging -- ie. it is complex. If we move some of these images out, we then need to coordinate keeping this complex build and image release process in sync across multiple repositories (a fix in one repository no longer fixes that build / tag / etc issue for all the images).
The samples we have today for Coco Pharmaceuticals make use of some sample data files, as well as (optionally) allowing these files to be used to populate databases that would hold sample data. While the sample data files themselves make sense to keep in core (no third party dependency or linkage), what about the scripts / logic that loads them to various databases for use as sample databases (given that these databases themselves are third party)? Would be overkill to create a separate area for each of postgres, maria, db2, etc; yet if we split strictly on core vs third party lines these don't really fit in core, either...
Similarly, while we can split for example the Ansible playbooks or other scripts across repositories depending on what they configure, I suspect there will be some core Egeria scripts and some third party (eg. IGC scripts) that would both be needed to setup a complex scenario (eg. the equivalent of VDC, without containers). If these are split across multiple repositories, this is likely to make the process of someone consuming them significantly more complicated (not to mention keeping them all in sync / tested together).

My gut feeling is that we may need to create a more well-defined plan for what these various samples are, where they overlap, and therefore where this commonality is kept vs. the sample-by-sample variation -- I think conceptually the same as you're suggesting on the third party vs not split above -- but I'm less certain this will fall strictly along third party vs not lines (as the VDC example sort of exposes)...

In favour of splitting some of these areas out from core (or indeed any currently-existing Java-based repo) entirely, another point to consider is the build process itself: currently any commits against these areas (sample files, scripts (whether for deployment or tutorials: Ansible, Docker, Helm, Python), etc) all require a full build of the core itself to be completed before they can be considered and merged -- yet none of them directly influence the Java code being built by the PR / merge process. As the community broadens, and our Java code-base continues to expand (and thus the builds go from 15-20 minutes to longer), could we be unnecessarily delaying contribution and evolution of these tutorials / deployment / samples areas by forcing them to exist in a Java code tree?

Could we also be missing out on more optimal approaches to the linting, static code analysis, vulnerability detection, building, automated testing, etc of these areas (eg. given current complete reliance on Maven for build and Sonar's Java profile for scanning)?

fpompermaier commented 4 years ago

Users

Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example

Core development team

Additional developers wishing to contribute occasionally, or aspire to do more a) New b) experienced

Those who just wish to use and/or understand our code from an external perspective

Those who want to read about what we're doing & follow the project

For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations

Hello everybody, I don't know whether this comment could be meaningful about the quoted problem or not but I hope so.

I am following Egeria since a couple of months and I am a huge fan of it: it addresses many of the problems I have to deal with almost every day..

I'd like to be able to keep up with the developments and being able to contribute somehow but it's a VERY broad project (with many complex architectural components: basically I feel a little bit afraid of stepping in because of the complexity of the entire project (and also because I could contribute to it only in my spare time currently).

What I'd like from Egeria is to help people like me to understand how to get comfortable with all of its modules in an incremental way...although this is a very complex and time-consuming task as well..

planetf1 commented 4 years ago

@fpompermaier thanks for the post. Initially this is probably more general than repos (though ensuring the structure makes following easier not harder). - can we continue this via slack.(slack.odpi.org & try channel #egeria-discussions ). If that doesn't work perhaps a separate issue - just to help keep this one focussed on the repo specifically? Def value your thoughts and let's persue!

planetf1 commented 4 years ago

One option we could take both for samples and connectors that are

liable to be picked up as standalone components
have highly varying dependencies and deployment approaches
non java based

would be to stop having a top level pom that tries to build everything as a single multi-module project, but instead contains a few independent file trees which may not even be pom based

needs some more thinking about...

planetf1 commented 4 years ago

Revisiting this since at the moment vdc is an awkward fit. We're not focussed on it, we have many issues open, it has dependencies on other repos (connectors)

Core egeria currently has docker build support for ranger & atlas (from source) as well as gaian (from a tar). These are not really part of egeria itself and IMO should be outside egeria. Not only that but the integration is incomplete and not a current focus. This of course not apply to the core egeria docker image which is absolutely part of egeria
The ranger/atlas aspects can be moved to the egeria hadoop repository
We have a vdc helm chart which makes use of core egeria, the hadoop content, & the ibm content. Mostly it works well for ibm igc & egeria, but as above the hadoop components are not complete. We could move the vdc chart there and cut it down JUST to an igc/egeria helm chart. In parallel the lab chart would remain in core egeria, as would new work on operators. At some point the igc chart could be refactored to pull in a subchart based around egeria/operator or directly consumer operator
The vdc chart could also be stripped of IGC and placed in the hadoop repository though with atlas not currently working, and many components unconfigured this isn't necessarily that useful until we put more effort into supporting that environment - which I don't see in the short term
An alternative would be to create a new repo just for the helm chart (as it depends on egeria+ibm+hadoop) - this may be easier in the short term
Further breakdown is possible - like moving out the docker egeria image, lab chart, operator, samples, but I don't think we're yet at the point of this being compellingly obvious

planetf1 commented 4 years ago

So to clarify my proposal

create a new repo for vdc chart & move chart across (it's just a few simple files/trees & no build process is needed)
move code for atlas/ranger (and also gaian!) docker images to our hadoop repo as part of our larger migration - this will in theory require a build process which can be copied from egeria, but given the images aren't changing isn't immediately urgent
move issues tagged vdc to one of these as appropriate

The result is

vdc chart will still work
clutter is removed from egeria repo

Further long term refactoring may be needed as we figure out where we are going with 'vdc', virtualization, operators/containers/charts, igc

planetf1 commented 4 years ago

^^ @cmgrote @mandy-chessell

mandy-chessell commented 4 years ago

Its a good temproary solution

cmgrote commented 4 years ago

I'm not sure I follow. An entire repository just for the VDC chart seems like a lot of overhead (?) Why not a repository for all assets that don't have any build requirement (all Helm charts, sample data, Ansible automation, etc)?

Fine with moving the build stuff for Ranger, etc to the other repo -- thought we'd already agreed that long ago...

planetf1 commented 4 years ago

Ranger - all good there. just checking

The non-build content? The reason I hadn't yet suggested all was primarily that we still have our tutorials for docker-compose, and helm in the base repo. We know they've been successful with people using them as part of getting up and running with egeria. The question is whether splitting them away from the core is helpful or not.

The 'lab' environment is much more focussed on core egeria, so has less external dependencies/noise - though it does include jupyter, and in future may need ldap, whilst vdc has much more 3rd party content (hadoop & ibm)

For the purest split I'd agree with you, but the rationale for just moving the vdc chart for now is that it's a particular pain point, across dependencies, build, future direction. At least if it's a small and self contained repo we can easily move again or rename. We can also have a longer brainstorm about what really is best .

So I still think we should sort out the vdc chart, ranger repo first

cmgrote commented 3 years ago

Discussed again on today's call -- suggestion is to move the following to start with, and then we can get into the detail of any others as-needed later on (name the repo egeria-samples):

Phase 1:

Minimal samples (used for lineage)
Ansible playbooks for deployment
VDC charts

Phase 2 (more interdependencies):

Coco Pharmaceuticals samples
Egeria tutorials (reliant on Coco Pharma samples)

Phase 3:

others...

cmgrote commented 3 years ago

New repo here: https://github.com/odpi/egeria-samples

cmgrote commented 3 years ago

Phase 1 should be complete with the merge of #3631 -- we may want to leave this open to track subsequent phases, though?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

planetf1 commented 3 years ago

Keeping this open -- we should move the coco pharma samples, for example the labs, in the not too distant future

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

mandy-chessell commented 3 years ago

Still work in progress

planetf1 commented 3 years ago

@mandy-chessell As discussed, I think we can move the coco/samples out of base egeria to the samples repo. - note that the docker build for jupyter assembles our notebooks into the docker image this needs to be done together or we will break the merge build (not PR build.. so easy to miss)

We can also move the helm charts (certainly for coco - there is another more basic one)

The docker image for egeria itself should stay as it forms part of the deliverable of the base repo, but the other docker images should move.

planetf1 commented 3 years ago

@mandy-chessell is looking at the initial refactoring

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

odpi / egeria

New "samples" repository? #2739