odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
795 stars 259 forks source link

New "samples" repository? #2739

Closed cmgrote closed 3 years ago

cmgrote commented 4 years ago

We've discussed on a number of calls for a while now the potential for creating a separate code repository for our "samples", but there did not seem to be an explicit issue for this decision (hence adding this one).

Part of what we need to decide is what should be included under this umbrella term of "samples":

Related: #1643

planetf1 commented 4 years ago

I thought it worth setting the scene of where we are at with our repos.

It's perhaps worth considering what structure we want moving forward - of course this may change, but deciding at the top level of repos (vs smaller details of builds, artifacts, modules) is significant

My initial reaction is that on a very specific question of whether we should have a repo that aids end users in code samples, I would absolutely agree, as it's unit they can fork, clone, modify, build.

Taking just one example, like sample metadata or helm charts there are many levels

However the list above goes a lot beyond this, so I think we need to take a step back:

Current Repos

We have a number of project for or used by egeria

Some have a different scope within ODPi but have some relevance to us

Whilst others relate to other activities within the ODPi - either past or present:

For the purpose of this discussion I suggest focussing on the first set only.

It's also worth noting there's only a partial naming convention - and moving forward we should have such a convention

It's worth noting we have numerous other outlets (and arguably) repositories for our work including

But for this discussion will focus on the core artifacts under source code management

Users

Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example

For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations

Types of artifact

We have all kinds of artifacts in our repositories/systems. For example (not a complete list!)

Additionally some artifacts relate to open source components, whilst some are proprietary

Other considerations

Github Repo

Within a single github (and hence git) repo:

Whilst some of these can span repos, this becomes harder to coordinate. Simple cases might include having a seperate repo for build or test.

Git does support the concept of a sub-project where another project can be pulled in/updated on demand (a little like updating a dependency version) - but this adds complexity

Problems noted recently

Trying to build a map of these requirements

It's probably time to consider the mantra 'less is more'. But does that mean less in a repository (clear, targetted) or less repositories!

........ NEXT POST to follow

planetf1 commented 4 years ago

I should add that any conclusion we reach might be good to document (but where... !!)

cmgrote commented 4 years ago

I should add that any conclusion we reach might be good to document (but where... !!)

Right here on this issue, for starters 😄

planetf1 commented 4 years ago

Excuse the long post - a bit of a brain dump.

First point of note - anything in these repos needs to be suitably general and not super specific to one particular environment. If we have artifacts of those nature they belong with the org that created them, OR we have a 'contrib' git repo which has minimal rules/overhead on contributions

My view here is that fundamentally, keeping our assets together makes them easier to manage, support, release. As such the default is that we should continue using a single main repository ie egeria

Third party code

Some assets are quite different from core egeria - as per some of the factors discussed above

Candidates for this include

Would need a build and test pipeline. Could be quite varied by component, but probably unfeasible to have a repo for each distinct connectors

it may be anything small/manageable we keep in the base, but those with a more complex nature we have separate, so we could stick with the existing repo for igc/datastage, add a new one for hadoop (as many moving parts), but keep cassandra in the core (less invasive, open source)

Sample projects

egeria

egeria-deployment

We have various artifacts that are closely related to core egeria, but don't technically form part of it

But this is closely related to core egeria, there is also some interdependency when we do testing. So I am not certain we should do this split - and if not the above belong in egeria itself

I believe strongly that if we do this (and even if we don't) we do not have any dependencies on other projects - so we do not include tests around hadoop, igc, or indeed the helm charts around vdc.

I'm 50/50 on this one

egeria-vdc

We have a few pieces left over

They are an uncomfortable fit in base egeria already - for example the vdc chart depends on ranger (which we're moving out) and the ibm conectors (which have). That being said it's not a build time issue. yet.

Or is it so special we put them in it's own repo - which can then depend on others (logically)

To return back to Chris's proposal:

Existing sample data and metadata (eg. under open-metadata-resources/open-metadata-deployment/sample-data)?

Existing Ansible playbook-based loading / clearing (also under open-metadata-resources/open-metadata-deployment/sample-data/...)?

Existing Helm chart-based deployment (under open-metadata-resources/open-metadata-deployment/charts)?

Existing Compose-based deployment (under open-metadata-resources/open-metadata-deployment/compose)?

Existing Docker image builds (under open-metadata-resources/open-metadata-deployment/docker)?

Existing tutorials (under open-metadata-resources/open-metadata-tutorials)?

cmgrote commented 4 years ago

Sounds reasonable, just a few grey areas that come to mind that leave me undecided:

My gut feeling is that we may need to create a more well-defined plan for what these various samples are, where they overlap, and therefore where this commonality is kept vs. the sample-by-sample variation -- I think conceptually the same as you're suggesting on the third party vs not split above -- but I'm less certain this will fall strictly along third party vs not lines (as the VDC example sort of exposes)...

In favour of splitting some of these areas out from core (or indeed any currently-existing Java-based repo) entirely, another point to consider is the build process itself: currently any commits against these areas (sample files, scripts (whether for deployment or tutorials: Ansible, Docker, Helm, Python), etc) all require a full build of the core itself to be completed before they can be considered and merged -- yet none of them directly influence the Java code being built by the PR / merge process. As the community broadens, and our Java code-base continues to expand (and thus the builds go from 15-20 minutes to longer), could we be unnecessarily delaying contribution and evolution of these tutorials / deployment / samples areas by forcing them to exist in a Java code tree?

Could we also be missing out on more optimal approaches to the linting, static code analysis, vulnerability detection, building, automated testing, etc of these areas (eg. given current complete reliance on Maven for build and Sonar's Java profile for scanning)?

fpompermaier commented 4 years ago

Users

Our artifacts that we are trying to manage are used by different communities. Also the level of 'read' vs 'write' will differ - for example

  • Core development team
  • Additional developers wishing to contribute occasionally, or aspire to do more a) New b) experienced
  • Those who just wish to use and/or understand our code from an external perspective
  • Those who want to read about what we're doing & follow the project

For a particular role, having to deal with too many repositories (especially if outside the core team) isn't desirable - there needs to be very distinct seperations

Hello everybody, I don't know whether this comment could be meaningful about the quoted problem or not but I hope so.

I am following Egeria since a couple of months and I am a huge fan of it: it addresses many of the problems I have to deal with almost every day..

I'd like to be able to keep up with the developments and being able to contribute somehow but it's a VERY broad project (with many complex architectural components: basically I feel a little bit afraid of stepping in because of the complexity of the entire project (and also because I could contribute to it only in my spare time currently).

What I'd like from Egeria is to help people like me to understand how to get comfortable with all of its modules in an incremental way...although this is a very complex and time-consuming task as well..

planetf1 commented 4 years ago

@fpompermaier thanks for the post. Initially this is probably more general than repos (though ensuring the structure makes following easier not harder). - can we continue this via slack.(slack.odpi.org & try channel #egeria-discussions ). If that doesn't work perhaps a separate issue - just to help keep this one focussed on the repo specifically? Def value your thoughts and let's persue!

planetf1 commented 4 years ago

One option we could take both for samples and connectors that are

would be to stop having a top level pom that tries to build everything as a single multi-module project, but instead contains a few independent file trees which may not even be pom based

needs some more thinking about...

planetf1 commented 4 years ago

Revisiting this since at the moment vdc is an awkward fit. We're not focussed on it, we have many issues open, it has dependencies on other repos (connectors)

planetf1 commented 4 years ago

So to clarify my proposal

The result is

Further long term refactoring may be needed as we figure out where we are going with 'vdc', virtualization, operators/containers/charts, igc

planetf1 commented 4 years ago

^^ @cmgrote @mandy-chessell

mandy-chessell commented 4 years ago

Its a good temproary solution

cmgrote commented 4 years ago

I'm not sure I follow. An entire repository just for the VDC chart seems like a lot of overhead (?) Why not a repository for all assets that don't have any build requirement (all Helm charts, sample data, Ansible automation, etc)?

Fine with moving the build stuff for Ranger, etc to the other repo -- thought we'd already agreed that long ago...

planetf1 commented 4 years ago

Ranger - all good there. just checking

The non-build content? The reason I hadn't yet suggested all was primarily that we still have our tutorials for docker-compose, and helm in the base repo. We know they've been successful with people using them as part of getting up and running with egeria. The question is whether splitting them away from the core is helpful or not.

The 'lab' environment is much more focussed on core egeria, so has less external dependencies/noise - though it does include jupyter, and in future may need ldap, whilst vdc has much more 3rd party content (hadoop & ibm)

For the purest split I'd agree with you, but the rationale for just moving the vdc chart for now is that it's a particular pain point, across dependencies, build, future direction. At least if it's a small and self contained repo we can easily move again or rename. We can also have a longer brainstorm about what really is best .

So I still think we should sort out the vdc chart, ranger repo first

cmgrote commented 3 years ago

Discussed again on today's call -- suggestion is to move the following to start with, and then we can get into the detail of any others as-needed later on (name the repo egeria-samples):

Phase 1:

Phase 2 (more interdependencies):

Phase 3:

cmgrote commented 3 years ago

New repo here: https://github.com/odpi/egeria-samples

cmgrote commented 3 years ago

Phase 1 should be complete with the merge of #3631 -- we may want to leave this open to track subsequent phases, though?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

planetf1 commented 3 years ago

Keeping this open -- we should move the coco pharma samples, for example the labs, in the not too distant future

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

mandy-chessell commented 3 years ago

Still work in progress

planetf1 commented 3 years ago

@mandy-chessell As discussed, I think we can move the coco/samples out of base egeria to the samples repo. - note that the docker build for jupyter assembles our notebooks into the docker image this needs to be done together or we will break the merge build (not PR build.. so easy to miss)

We can also move the helm charts (certainly for coco - there is another more basic one)

The docker image for egeria itself should stay as it forms part of the deliverable of the base repo, but the other docker images should move.

planetf1 commented 3 years ago

@mandy-chessell is looking at the initial refactoring

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.