Parsing of manifest.json as well as better yaml parsing support / alias support

z3z1ma commented 3 years ago

Picking up the ball where fernandobrito left off. I have a major enhancement (one of two in the works). Lots of cleaned up code, bug fixes from the other fork, personally tested on my own production replicates and internally at my company, etc. There may be some more changes/commits I add (to readme mostly) but I just wanted to post it to get your eyes and awareness. The other PR will involve auto documenting all exposures revealed in metabase through questions / dashboards and constructing a YML that allows us to sync a DAG with metabase fully including click throughs from dbt docs exposures directly to dashboards.

fernandobrito commented 3 years ago

Hi @z3z1ma, great that you picked up the work of merging my branch, thanks! I was away for a month on vacation and now that I'm back there are so many other priorities in my company that I'm not sure when I would be able to come back to my contribution here. Let me know if you or @gouline have any questions on any of the code that I wrote.

fernandobrito commented 3 years ago

On a side note, I'm very curious about your next PR where you mention "involve auto documenting all exposures revealed in metabase". Over the next 2 weeks, I will open-source an internal Python project from my company on doing exactly the same, but for Tableau dashboards. I use the Tableau Metadata API to extract custom SQL and dependencies from reports to tables in the database, then I read all dbt models and sources from the manifest.json and try to match Tableau dashboards to dbt models/sources. After that, I write those dependencies back in the manifest.json as exposures.

I'm naming the Python package "exposures-crawler" because I wanted at some point in the future to include Metabase and other clients as well. I will write here once the package is public, but I already wrote a tiny bit about it on the dbt Slack community: https://getdbt.slack.com/archives/C0VLNUUTZ/p1617012523148700. If according to the author of dbt-metabase, your exposures automation is out of scope, we could discuss teaming up together and merging our 2 projects.

constructing a YML that allows us to sync a DAG with metabase fully including click throughs from dbt docs exposures directly to dashboards.

That last part I didn't really understand.

z3z1ma commented 3 years ago

constructing a YML that allows us to sync a DAG with metabase fully including click throughs from dbt docs exposures directly to dashboards.

That last part I didn't really understand.

Hey @fernandobrito

To clarify, I mean the dbt docs can also serve as a index which links users directly to the exposures that have been parsed. The reverse essentially of what you did adding dbt doc url appending to metabase decription. There is actually a button (see below screencap) whose link is derived from the exposures yaml.

On a side note, I'm very curious about your next PR where you mention "involve auto documenting all exposures revealed in metabase". Over the next 2 weeks, I will open-source an internal Python project from my company on doing exactly the same, but for Tableau dashboards. I use the Tableau Metadata API to extract custom SQL and dependencies from reports to tables in the database, then I read all dbt models and sources from the manifest.json and try to match Tableau dashboards to dbt models/sources. After that, I write those dependencies back in the manifest.json as exposures.

Are you using sqlparse? I've used just regex myself to extract (table level) exposures from native sql questions in metabase with pretty good success actually. I keep the native sql use in metabase low opting instead for better models which allow users to answer questions using just the GUI and tools. In a previous go around there was a lot of native SQL I allowed myself and my users in metabase, but it is definitely not ideal and dbt has been a gamechanger there in how quickly I was able to simplify our extraction of value from the underlying data.

I'm naming the Python package "exposures-crawler" because I wanted at some point in the future to include Metabase and other clients as well. I will write here once the package is public, but I already wrote a tiny bit about it on the dbt Slack community: https://getdbt.slack.com/archives/C0VLNUUTZ/p1617012523148700. If according to the author of dbt-metabase, your exposures automation is out of scope, we could discuss teaming up together and merging our 2 projects.

I am curious on yours too. My implementation involves constructing a yaml (we can call it metabase-exposures.yml in root model dir for example) over manipulating manifest.json (though I will certainly read from the manifest). I am taking this approach because it doesn't involve continuously executing a script or an additional step in CI/CD every single time dbt is invoked and the manifest.json is reconstructed. Instead a yaml is very readable, introspectable, and all subsequent dbt runs will add exposures to manifest implicitly without a post-run dependency. It also lets us revision and commit the yaml to source control too for what that's worth. I'd guarantee there an intersect of our interests and implementations that would be beneficial though so keep me posted. The end result of both of our methods is (should be) a manifest/doc site with perfectly documented exposures.

gouline commented 3 years ago

@z3z1ma Just as an FYI, you can run all validation locally with 'make lint type test' without having to push each commit.

z3z1ma commented 3 years ago

@gouline

Thanks for the heads up and sorry for blowing up your inbox then lol. Appreciate it.

z3z1ma commented 3 years ago

@gouline

Let me know next steps,

@fernandobrito & @gouline

I have the next PR about ready once and if the master is updated. See below examples. All Analyses, Dashboards, and even native sql exposures in Metabase are automatically documented and compiled into a dbt yml. This includes creator/owner of analysis, click through from exposure documentation to exposure, etc. Getting a lot of positive feedback from dbt community on this one.

gouline commented 3 years ago

@z3z1ma Remaining step for this PR is tests. Then @fernandobrito and I can both review the whole thing and I'll merge it when everyone's happy.

z3z1ma commented 3 years ago

Sounds good. So with the structure you have in place it's just a matter of providing mock data right? Since it passed everything else? Let me know if there are any other considerations and thanks for your guidance. I appreciate you. If I'm on the right track I'll knock it out tonight.

gouline commented 3 years ago

Mock data for each reader and assertions for all the edge cases. The objective isn't as much to test dbt schema and manifest files (they do that themselves) but to ensure that your code is handling all the use cases properly and when future PRs come through and change that code, we're not breaking anything that you intended.

z3z1ma commented 3 years ago

@gouline

First pass unit tests complete. Think they look really good. Uniform in expected output. The Metabase client test is going to be a little more involved (though I haven't changed it, functionally). Wondering if you have any ideas there to make the test simpler- otherwise my only additional comment is I have a PR I want to push down the pipeline after this which does add a function to Metabase client (exposure extraction) and uses existing function of our two parsers to cross reference models generating YAML for DAG. I wouldn't mind diving more deeply into Metabase client tests in that PR. Either was is okay, just lmk what you think direction-wise. Thanks again!

gouline commented 3 years ago

Yeah, don't bother with Metabase client tests in this PR.

If this is ready for review, @fernandobrito can you please review the changes first, since it's based on your original code? When done, I'll do a review too.

fernandobrito commented 3 years ago

Ehehe, it ended up being a quite long PR. I didn't have time to actually clone this locally and try to run it against the real dbt and Metabase instances where I work, but I went through the code and added a few comments/questions.

Something that would be nice would be to have some sort of plan to deprecate the YAML parsing feature in a couple of releases. That could help reducing the codebase.

Thanks again @z3z1ma for picking this up. Next week I am busy, but I would also be available to address comments or write more tests afterward if needed.

z3z1ma commented 3 years ago

@gouline

I think the ball is in your court now.

Note that I already address empty strings referred to by #23 We only pass descriptions in 2 places, exporting columns and exporting tables. For tables, a simple logical check since without a description, there is nothing to propagate. And for columns, a simple ternary statement that checks for empty string. I think in the future, @fernandobrito approach would be slick:

It's a descriptor (https://realpython.com/python-descriptors/) to be used in class attributes that automatically convert empty strings to None.

but for now we are ok.

The other issue raised in #24 I think should be examined and integrated if it addresses the issue after this PR so we dont step on our toes. Once thats done we can integrate changes easily I think.

gouline commented 3 years ago

You renamed it in the wrong direction. Master has requirements-test.txt in the filename, setup and CI config.

z3z1ma commented 3 years ago

You renamed it in the wrong direction. Master has requirements-test.txt in the filename, setup and CI config.

Thanks correcting now.

gouline commented 3 years ago

Looks like you're still editing. Let me know when it's ready for review.

z3z1ma commented 3 years ago

Its good to go @gouline,

Freshly ran this branch on my production instances of Metabase/dbt too with both manifest and yml parser. Super easy. Good stuff. We are fully integrated into CD flow whenever we push any changes to dbt repo, our mb is synchronized. Onto posting my exposure parser PR once this is merged.

gouline commented 3 years ago

I see you pushed another commit after your response, but I'll assume that's it.

Thank you so much @z3z1ma and @fernandobrito for your efforts. I'm merging this and keeping it in master while we test it and merge other pull requests as they come along (there's one open now with a fix for an issue in master). Once that's all done, I will push a release to PyPI.

fernandobrito commented 3 years ago

Amazing work!

I will also try to find some time at work tomorrow to run what you have now on master on our production dbt, as so far I have been running my fork on our CD pipeline.

fernandobrito commented 3 years ago

I was able to swap my production setup from using my fork to using master and had no issues, with the only exception that on master either schema or schema_excludes is required. I wrote about it here: https://github.com/gouline/dbt-metabase/pull/28/files#diff-399311141fa635ad8842613b2e8d964a08594d589911c6d1c0d5dc7d39bc44e9L72

gouline / dbt-metabase

Parsing of manifest.json as well as better yaml parsing support / alias support #19