Open maroshmka opened 5 years ago
Hey @maroshmka thanks for reaching us for this!
So... if I understand it correctly you want to build a visualization for how any of our databases it's defined (schemas) and what's the status on each one of them (quality, etc) ? But that would be done manually as of right now.
What would be the interactions you are mentioning? What's the main scope of this? Just to have a listing of all our databases and their data quality? What about monitoring the flow of the data somehow? Does it make sense?
There are some features that we could definitely share, we actually thought about some checks that would ensure that the different representations for the same object should be consistent across all our services. For now we wanted to focus on API level, but it might be interesting to do it also at DB level.
I see that there are some overlapping features with what we are developing ant it could be interesting to join efforts on this, but I'd like to see a more formal specification, with specific requirements that you would need to develop.
Ok, im gonna try to answer your questions and then sum it up somehow.
if I understand it correctly you want to build a visualization for how any of our databases it's defined (schemas) and what's the status on each one of them (quality, etc) ?
More or less, yes. We wanna add convenient search, unify model regarding type of db (bigquery, postgres, redis...), show data-quality, show owner (we must discuss what ownership means). We wanna add data lineage, which should be one of the main points.
What would be the interactions you are mentioning?
I meant that system doesn't need to be strictly read-only web. It can allow you for example to - setup notification on data-quality drops to slack, create data-quality check (if you have perms of course), edit descriptions for bussiness/dev clarification (again if you're supposed to) etc.
What's the main scope of this?
I'm not sure what do you mean. Main use-case? Or how big it is ? It seems that it will be bigger project as for the scope. As for the use-cases, some of them could be:
What about monitoring the flow of the data somehow? Does it make sense?
Yes, edges in the graph would be created by ETL. Now we have Airflow, but can be other tool we'll use in the future that will connect the dots. Which means, we will see overall dataflow.
More formal specification - I can't give you them now. This discussion should be exactly about that - does it make sense to start to develop this project as one ? If yea, let's gather more formal specification. For now its about the vision. Vasek Dorazil is currently Product Manager on this one, so maybe he has more formal specification than me that he can share.
Example of such projects:
Hmm, I see. I definitely think that having this together in The Zoo would open many possibilities, I really think that if we wouldn't merge it at least I'd like to integrate it with that service somehow. At the end it's about resources that our services are consuming.
https://github.com/lyft/amundsen This one actually looks pretty nice, can I ask what would prevent us to just use this one or build on top of it?
I meant that system doesn't need to be strictly read-only web. It can allow you for example to - setup notification on data-quality drops to slack, create data-quality check (if you have perms of course), edit descriptions for bussiness/dev clarification (again if you're supposed to) etc.
Actually AppSec is working on notifying the results of the checks to Slack, not sure how your data-quality checks would be defined, but I think they could be built on top of our code checks, although it seems more like an SLO type of metric than something more complex.
Overall I must say that I love the idea, I just want to make sure that building this on top of The Zoo will be useful for you and it won't compromise anything for us. So far I don't see that happening, as this will most probably be built as a new package inside of our Django app, but we'll definitely need to modify some of our core features to allow you to extend it easily.
Have you taken a look on our code? If not please do it 🙂
Btw, let's have a quick call next week regarding this, and maybe Vasek and let's agree on a proposal?
Hello guys,
we have a project called Databook. Conceptually it is the same thing as the Zoo, except it tries to manage metadata about internal data world - databases, ETLs and reports.
The architecture should be pretty similar. Imagine we're building a graph/map of data in kiwi.com - the Nodes are filled by crawlers (I believe you call them scanners) and then someone create the Edges (ETL in our case). So, we have crawler for Postgres, BigQuery etc., that fills the metadata about tables/schemas/settings then we take them and visualise / allow some interaction on the web.
I was thinking if we should continue developing separate system for it or if we could merge it with the Zoo. By that, we would have a system that should be to able to map and interconnect overall infrastructure inside the company. You would just put credentials (gitlab, postgres, google..) for crawlers and you would have data lineage visualisation from "source system" (e.g. booking) -> "revenue report". Plus lot of other features as well, e.g. data-quality reports, rest api best practices etc.
We may share some internal code for logic, maybe only FrontEnd part, maybe deployment part, or yea, maybe nothing.
What is your opinion on this cooperation ? Would this be viable ?