opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
71 stars 16 forks source link

Research: Configure full importOssDirectory Pipeline in Data Warehouse #711

Closed ravenac95 closed 9 months ago

ravenac95 commented 10 months ago

Currently this is a research task due to some complications in actually attempting to implement this using the assumed-to-work pipeline of tools (airbyte, dagster, dbt) see here. There's a now deprecated integration that made some of this work. Some additional planning needs to be made in order to get this to properly function. Once that's complete we can/should complete the documentation related to the architecture and the developer docs as we hope to see it. However, as it is, the current limitations add some complications that would complicate collaborations as the number necessary of tools increases.

This research should be time boxed for another day or so.

ravenac95 commented 10 months ago

Ok spent the majority of today attempting to research alternatives. First, this issue wasn't clear as to why we would want any of this behavior. The main thing that we should be looking to to design is an easy to collaborate as-code data collaboration platform. However, what I found in attempting to go from using the UI to turning this into code is that not all the tools have a way to properly configure everything as code. Namely, this issue exists between dagster and airbyte. Airbyte is seemingly going through a massive refactoring of it's source/destination/connector management that will either break (it's not yet clear if it will break) features that we could use to fill the gaps that the current tools deprecate or it will address all associated problems. However, things are currently half-baked at airbyte and so it's a bit of a risk. As it stands, it's not possible for us to write custom connectors for airbyte and also define those sources/destinations as code without writing our own integration. I wouldn't mind writing that integration if the API we would need to depend on wasn't being "removed" in early 2024

That being said, unlike the under-engineering from the previous iteration of OSO I'd like to make sure we've looked at as much of what is actually available in this space. I've considered the following options

ravenac95 commented 9 months ago

Completed and merged