rsyi / whale

🐳 The stupidly simple CLI workspace for your data warehouse.
https://rsyi.gitbook.io/whale
GNU General Public License v3.0
724 stars 38 forks source link

Plans for Apache Atlas support/integration #51

Open nevi-me opened 4 years ago

nevi-me commented 4 years ago

Hi,

This is related to #3 . Are there plans to support Apache Atlas (https://atlas.apache.org)? It's a metadata store that'll include other things like business catalogs and glossaries. There's some integration with Amundsen, where the latter can store data on Atlas instead of Neo4j. In that case, supporting Amundsen API might be one way to support Atlas.

rsyi commented 4 years ago

Aha! I knew there were more of you. :) I'm super interested in building this out, but I still need to scope it out - largely, I haven't looked at the amundsen metadata library or the apache atlas API enough to be able to tell. I can take a look today and let you know ASAP how feasible it would be.

nevi-me commented 4 years ago

No worries, no need to do it ASAP. Atlas' API is quite involved (at least from my experience), but there's https://github.com/jpoullet2000/atlasclient/tree/master/atlasclient which many people seem to be using.

I'm tempted to write an Atlas client in Rust, but for now I'm forced to work in Java and Python; plus I can't justify bringing in JNI or FFI for just a REST client :(

rsyi commented 4 years ago

Ah! Yeah Java and Python are much more widely used these days, still. A rust atlas api would be amazing, though.

Without the rust atlas api, though, it actually doesn't seem too difficult -- this python client seems pretty reasonable. Let me give it a stab and I'll get back to you (it might be a little wait until I can get to this though).

FYI, my current thinking is to periodically scrape from atlas with the registered whale cron job or the github actions script, rather than hitting the API in realtime. Does that feel acceptable to you? Updates wouldn't propagate in realtime, but if the API is performant enough, it could be quite frequent.

nevi-me commented 4 years ago

Hey, I think the most involved work with a Rust client could be entity mappings. Atlas has an inheritance model where certain entites would have the same core properties, but differ a lot based on what entity typedef has been created. I don't imagine the REST API part to be a lot of work.

That said, it seems like Whale only uses Rust for the CLI, so perhaps writing a Rust client might be a tangent, as you could use the Python client. If it's something that you'd be interested in, I could help out with the Rust client. I might end up writing one either way in the next 2 weeks if the work that I'm doing on harvesting Spark lineage ends up requiring this path. I opened https://issues.apache.org/jira/browse/ATLAS-4004 because I can't use the Atlas Java client with Spark; so either way, I might need to write a Java client (or fork Atlas for their client).

rsyi commented 4 years ago

Hm. It is a bit of a tangent, but it is absolutely worth considering. I'll think about it more over the weekend. And definitely let me know as soon as you get to a point where you start building the Rust Atlas client.

I think the big question for me is what the best architectural choice is. The options in my head right now are just:

  1. Directly ping atlas for search and data. This gives the freshest data, but the CLI search will be massively slower, which I do not like.
  2. Query the API periodically with the python atlas client to get a list of all tables, but then directly ping atlas when rendering the preview. The latency against viewing the table info will feel a little bad, but this is offset by the fact that you basically always have fresh data.
  3. Use the python client to extract the metadata periodically. This has the disadvantage of being a bit resource-hungry against Atlas, but if the load's not that bad and the freshness isn't a huge concern (it generally isn't), then this is probably a reasonable option.

Feels like 3 is the easiest, but if you end up creating a rust atlas client, 2 could be more elegant work-around. I'll take a look to see how flexible/fully-featured/performant the atlasclient library is, first, though. If it's pretty solid, there might be no need for you build a whole new interface.

Also that spark lineage bit sounds SUPER interesting. Would love to know more about it :)

prakharcode commented 4 years ago

I have seen Atlas in work and can say that the API is performant enough if there are enough text-based optimizations around (NLP et. al.). I believe 3rd option should be easy to go with and should serve for most of the purpose, considering Atlas is also working to improve their search over time.

A rust client would be a good first step.

rsyi commented 4 years ago

Thanks, @prakharcode! Yeah let's go with this for now. I'll post here if I can get to this at some point, but in the meantime, either of you should feel free to post and take this if you're feeling ambitious. :)