Proposal To Simplify Connector Repositories Integration

odpi / egeria

Egeria core

https://egeria-project.org

Apache License 2.0

774 stars 259 forks source link

Proposal To Simplify Connector Repositories Integration #5943

Closed wbittles closed 2 years ago

wbittles commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

There has been a lot of dicsussions around git repositories lately and this issue is intended to propose a solution to the concerns raised.

Currently multiple Egeria connectors are grouped into single git repositories based on a logical grouping such as "database connectors". These git repos are essentially multi tenant git repositories with each connector an individual project , with it's own dependencies , builds, tests, documents and other assorted project artefacts. It's not just organisational issues that arise from these multi tenant git repos, there also concerns around when code is contributed and maintained.

Expected Behavior

Propose we move to a single use github repo model, where each connector is developed in it's own github repo, responsible for it's own life cycle and maintenance. Egeria could then include these connectors as dependencies into value add projects as dependencies, just the git sub module feature. For instance an Egeria project that wanted to develop a K8s / Juypter notebook demonstartion for a particular set of connectors could separate out the demo platform from connector development. It would also be possible to link connectors directly with the main egeria project again as sub modules , which would give the egeria project control over how the connector was distributed and deployed.

Alternatives

No response

Any Further Information?

No response

Would you be prepared to be assigned this issue to work on?

[X] I can work on this

mandy-chessell commented 2 years ago

Originally we only had one git repository (egeria.git) and all of the connectors along with the egeria runtime and UIs were in that repository. This meant that all security vulnerabilities found in the connector and UI dependencies were associated with Egeria. We have been in a process of moving the most troublesome connectors to their own repositories so that the associated vulnerabilities are isolated with the connector and deployers can make choices on which connectors to include in their platform knowing which are contributing vulnerabilities.

This has worked well, but at a cost because the monthly release process is getting more resource intensive with each new repository. We need a compromise solution.

I did a quick count and there are 75 connectors left in the egeria.git repository. The majority only have dependencies on Java and/or the Egeria runtime and are not worth moving.

I believe that the 2 JanusGraph connectors and the 2 Kafka connectors are the next important to move out. The TSC discussed which repository to put them in last week and whether they should each be in their own repository.

The compromise was that some connectors would continue with their own repos - eg the SAS Viya, XTDB - and we would also create a new repository called egeria-connectors that would house the majority of the connectors with external dependencies. They would build independently and create their own jar. This process separates the vulnerabilities from the connectors from the egeria runtime which is pretty clean. It does not separate the vulnerabilities from each of the connectors. Connectors with troublesome dependencies will still need to be pushed to their own git repository.

However, to go much further with this we need some automation in the release process that works across the repositories since we are already struggling with the 18 repositories we have already.

planetf1 commented 2 years ago

I can see significant attraction in having a unique repository per-connector as proposed by @wbittles

it's very clear what the repo contains (for someone working on the connector)
the ci/cd for build, test & release can be simple & appropriate for that connector
builds, clones, scans are quick & simple

We can though still have some isolation within a single repository - as you describe the multi-tenant repo
- keeping the individual projects distinct (ie no common dependencies/maven/gradle definitions)
- having seperate ci/cd actions that only concentrate on one tree
- But it is a compromise
Whether in a single repo, or multiple, each connector should contribute artifacts such as
- a zip archive or jar -> maven central currently (could post elsewhere in future)
- build attachments (for for debug/verification)
- a docker image (ie a customized egeria image to aid testing)
So far our main 'aggregate' projects depend on these build outputs - in fact all we currently have really is our egeria-charts repo. This depends on docker images in general (charts could also pull maven artifacts to retrieve connectors .. or pull from another location). I think of this as a binary dependency

We have talked about an 'uber-assembly' but I think this is most likely to also use maven to gather binary artifacts rather than source

Java code generally is more likely to depend on egeria for build, rather than connectors (though they are needed at runtime). There maven artifacts are probably suitable.

There could be some more source-led repos where git submodules are an appropriate technique, though I don't think we have any currently.

The challenges to multi repos are

management
ensuring suitable consistency in process
release cycle (as Mandy mentioned)
findability
monitoring, scanning (security, code quality)
whether getting on for 100 is simply too many

Overall I like the idea of distinct repos, probably not fully due quantity, but only when we can address the consistency/management
- Guideliines / responsibilities for repo owners etc could & should be written up (I agree we don't have enough yet) - this is relatively fixable with a little time
- we need cross-repo automation & common scripts across all repos, with 'suitable' (which we need to define) configurability. Early in the project we were using build support from the LF team, and they did this by having a common CI project for common artifacts, docs, processes

It's this that may take substantial time - it's a good idea, but with our current team size I think we'll struggle to do quickly. Though cross-repo management is already an issue, and is only going to get worse.

I would end by suggesting again it's a good end goal... and my questions are really about how we could make it happen even to the 19s of repos:

the LF support team may be able to offer some ideas of how we can scale?
we could look around at any comparable projects and how they do it?
we could have a design discussion to brainstorm? I can talk through the repo setup, maint, release aspects
Next time we have a new repo to setup perhaps you could do the setup @wbittles and we can transcribe that into better docs & guidelines and potentially automation. Similarly for release . Maybe this is a first step & would help spread the workload? If we could get 3 people involved we might have a critical mass?

All the above is applicable whether we just add a 'few' or 'lots'

wbittles commented 2 years ago

@mandy-chessell

It's not obvious why the reporting of vulnerabilities at a repository level is that useful, it just means the user has to calculate the attack vectors of the deployed system, however with each connector in it's own repository this requirement of vulnerability isolation would be satisfied.

Would it possible to get the details of the following point please

"This has worked well, but at a cost because the monthly release process is getting more resource intensive with each new repository. We need a compromise solution. "

To my knowledge the Egeria release process doesn't interact with the connector repositories, they are totally independent once the dependency jars have been consumed by the build. This is one of the benefits of this proposal Egeria would get the "bring them all home" ability ( well what had been choosen to be there )

This proposal is that everything should be in it's own repository , so it's really only concerned with the database and apache multi tenant repositories, both of which need refactoring to support multiple tenants.

@planetf1
It's not obvious what benefits are being gained from the multi tenant approach and it would be helpful to get a feel for what value it's providing.

Another benefit of this proposal , ontop of the ones mentioned above is that we would have the ability separate out the core connector development and the value add work of jupyter notebooks , K8S etc. Currently this is all lumped into the connector development.

Using git sub modules provides support for commits made in each project, letting the value add project make their own fixes., where maven dependencies require the fix to be made in the connector project.

I'm unable to find another instance of a project using this multi tenant approach, it would help if you could reference where this approach has been documented or discussed ? How much effort is required to deliver support for multi tenant repositories. ?

I don't understand the 100 reference would even be considered too much , The Docker hub has a git repository for every image, so what is the limiting factors with the number of git repositories ?

It's not clear to me what consistency management actually entais or what responsibilities will be passed to repository owners, again more clarity would be appreciated.

I'm unable to find another instance of a github project taking this approach as I said earlier evern docker hub uses a git repo per image and that alone would give rise to confusion and raise the need to socialize the entire process from scratch. I only have a limited view of the what these repositories are helping with and they look like loose informal taxonomies and I'm struggling with the concept of creating a repository as opposed to starting a project to achieve a goal.

I could easily refactor the repository for databases connectors which will enable that project to move forward.

please feel free to point to any relevant issues , docs etc.

cmgrote commented 2 years ago

I'm unable to find another instance of a project using this multi tenant approach, it would help if you could reference where this approach has been documented or discussed ? How much effort is required to deliver support for multi tenant repositories. ?

If I understood the suggestion for the multi-tenant repositories, I think there are many other examples of projects focused on the integration space (and therefore having 10's - 100's of connectors) that take this approach -- a quick list:

Airbyte has 100+ connectors in a single repository (see: https://github.com/airbytehq/airbyte)
Apache Airflow appears to have its core (~60) "providers" all in the same repository (https://github.com/apache/airflow/tree/main/airflow/providers)
Same with Dagster (https://github.com/dagster-io/dagster/tree/master/python_modules/libraries)

But indeed not all take this approach... For example:

DBT appears to use separate repositories for each of the (10-ish?) data platforms it operates on (https://github.com/orgs/dbt-labs/repositories)

So I don't think there's a definite "winner" either way, but certainly the "separate-repository-for-every-connector" approach cannot be categorically considered "the norm" in this domain...

wbittles commented 2 years ago

@cmgrote Thanks Chris, in the first example Airbyte , that looks like it comparable with the all connectors in Egeria model in that that git repository is building a single distribution and all those connectors are sub components of that distribution.

Our multi tenant repositories are different in that they are producing distinctly separate jars with no connection to each other other than they live in the same repository , it feels like we are missing a top level project to house them as a a unit. I don't know enough python to understand how the others examples are built.

It's like we need a "database connection pak" to bring these independent jars together.

cmgrote commented 2 years ago

True, they're all Python-based examples and thus have a different build / distribution mechanism than our Java-based code. Nonetheless, they do seem to include the connector code itself for 10s-100s of different connectors all in a single repository -- they don't refer out to other repositories that have that connector source code.

wbittles commented 2 years ago

@cmgrote My understanding is that the egeria build/release process was struggling to reach that kind of scalability and hence the need to try and split out some of the CVE's along with build and assembly etc. I believe it was the maven processing that was the bottleneck but @planetf1 will have suffered the pain and can probably provide more details.

cmgrote commented 2 years ago

Ah, I think the issue was slightly different: my impression of the connectors is that most of them are very small and individually probably take seconds to build via Maven. The concern around build time was therefore the reverse: that for such small components, waiting for the entire core of Egeria (+ FVTs) to build via Maven means that there's an extreme time penalty (1 hour+?) against the relatively tiny connectors when you want to make a change / update to a connector.

I would imagine that with each connector being (entirely?) independent from each other, they could also make use of extreme parallelism in the build process e.g. via Gradle or other means that we're not (yet?) able to fully take advantage of in the Egeria core due to its relatively more-intra-dependent nature.

So from a build-timing perspective I suspect this isn't a main driver for choosing one vs multiple repositories.

(I would also imagine, but would look to @planetf1 to confirm, that we could even have different "releases" of individual connectors from a single repository if they are following different development timescales / velocity, similar to what we're doing today in the Charts repository with the different charts?)

I think the other significant concern has been around the vulnerabilities that the connectors' dependencies may bring into the equation: i.e. that connectors like those in the Hadoop ecosystem may still rely on Java v8 and / or relatively old versions of libraries that have since had various high severity vulnerabilities identified within them. We don't want to have the entire project marked as having critical vulnerabilities when it's only a single connector that actually uses such a particular library, as a given adopter may not even have any reason to use that particular connector (or if they do, may be willing to accept the risk associated).

I have been assuming this is the larger driver that would make separate repositories attractive: that way we don't have a superset of all of these dependencies appearing to impact the entire repository...

But maybe there's an even simpler compromise / hybrid solution to that problem:

One GitHub repository, but no top-level pom at the root of the repository.
So no superset dependencyManagement section that tries to cover all dependencies across all connectors.
Instead, we have sub-directories within the single GitHub repository, where each sub-directory is a connector and contains a "top-level pom" for that given connector (with only its own dependencies).

We then have a single GitHub repository, but different Maven artifacts, build processes, etc within it (one per sub-directory).

(Though perhaps the only advantage of this for adopters to be able to see all the connectors in one place without needing to guess which repository to go to -- i.e. the build process may still be as complex as having separate repositories? 🤷 )

planetf1 commented 2 years ago

This discussion supports the view that this is a difficult balance to strike....

We already have some emerging repos (for samples, and dev-projects) which are an amalgam of mini projects, and the hadoop repo may go in that direction with a new atlas integration connector. We've previously discussed the same with connectors, where we do not have a root parent pom to keep build simple. It's an option - requires care when loading into IDE (ie select subdirectory) - and @wbittles did mention this in the first post. Keeping build seperate is fine - different scripts, but it does also add clutter.

I think where we have a lot of uniformity - in process, release cycle, community we should bundle Where we have specific needs due to complexity, dependencies, usage, release cycle then we don't -- typically the larger pieces of work We should also look at how to improve uniformity -- to the point it makes sense across repo - so for example resuable scripts/build fragments, version definitions. This may be where a git subproject actually could help, but it's just a light use (it's actually how the LF team originally setup our build to be more consistent across teams, but our needs diverged too much & we abandoned it)

As such maybe we need to continue making the call on a repo by repo basis depending on the needs of that project, and try to develop guidelines / rules for repo owners, as well as consider more automation for cross-repo tasks on 'signed up' (for that operation) repos - such as branch/release.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.