nicholasyager / dbt-loom

A dbt-core plugin to weave together multi-project dbt-core deployments
The Unlicense
104 stars 19 forks source link

Support pulling in multiple manifests from single bucket #31

Open akromish opened 6 months ago

akromish commented 6 months ago

Currently, dbt-loom supports pulling in a manifest from cloud storage using bucket name + object name.

However, for organizations with n number of dbt-core projects that need to peer with each other, adding an entry to each repo gets difficult. I propose that in the s3 and gcp clients, we add a method that allows for specifying just the bucket name. From there, dbt loom will iterate through all the manifests in the bucket and add them to the project.

I could take a first stab at implementing s3 version.

Edit: Would actually prefer trying this in artifiactory first if this is something we want to do. Can implement single and muli-manifest json pull from artifiactory

nicholasyager commented 6 months ago

Hi @akromish! Thanks for making this issue.

Admittedly, I've not put too much thought into how dbt-loom ought to operate for large mesh topologies, particularly large meshes with a high degree of connectivity. Based on your comment around n projects needing to be added to multiple downstream configs, it makes me think of something like this (taken to an extreme, of course!)

flowchart
  a --> x
  b --> x
  c --> x

  a --> y
  b --> y
  c --> y

  a --> z
  b --> z
  c --> z

In this sort of paradigm, it would definitely make sense to move away from one-off ManifestReferece declarations towards an approach that expects the reference in the ManifestReference to return one or more manifest files. For a path type, this could include glob support. For S3 and GCP this could be a bucket and object key, or a bucket, prefix, and suffix.

In any case, I'd love to better understand what your project topology looks like, and if this thinking is in aligned with your needs.

akromish commented 6 months ago

Hey, learned today that you can add diagrams to github comments lol!

So I see two cases where you might want to have multiple manifests pulled in:

1) as you diagrammed, where there are top level projects, and then projects that import those top level projects

This is the use case I'm interested in. For some context, what I want to achieve by doing this is to have one dbt repo on which I can use [`metricflow(mf)`](https://docs.getdbt.com/docs/build/metricflow-commands#metricflow) to query any metric in the data org

```mermaid
flowchart TB
  a --> x
  b --> x
  c --> x
  d --> x 

  mf --> |query| x
  linkStyle 4 stroke-width:2px,fill:none,stroke-dasharray: 5 5;
```

2) use case where every repo is a sister repo

This might be an unsupported use case, as I don't know how dbt would handle circular imports

```mermaid
flowchart TB
    a --> b
    b --> a

```

As you said, we would want ManifestReference to pull multiple files, or have collection of ManifestReferences. I think bucket and prefix make sense, but do you think suffix will be needed as I think we can fetch only .json from the dbt-loom side. Same question for glob.

Thanks!

nicholasyager commented 6 months ago

@akromish Thanks for the diagrams! 😍

Use case one definitely makes sense, and is really quite clever for bringing multiple project's semantic models into one project. I, too, am a little hesitant about use-case two. I believe (will have to confirm) that dbt-core 1.7.x allows circular dependencies at a project level, but not a model level (1.6.x did not allow circular project deps), so this should be doable. Edit: I was able to confirm that 1.7.x as of time of writing does not allow for circular project-level dependencies.

You've swayed me that this is useful functionality!

I think bucket and prefix make sense, but do you think suffix will be needed as I think we can fetch only .json from the dbt-loom side. Same question for glob.

This is totally fair! My mind went to a scenario where people might modify the name of their manifest files. It can be added later if we need it.

If you're still up for it, I'd love to see what you come up with. I'm not particularly familiar with artifactory, but I'd be open to a contribution that provides support.

geoHeil commented 6 months ago

I intend to sue dbt-loom in a context of dagster, dbt-core and branch deployments https://docs.dagster.io/dagster-cloud/managing-deployments/branch-deployments

individual domains will have their own dbt projects and for each one there would be a main/feature-xxx branch

it would be neat if such a branching could be supported natively - for now the consuming project needs to know the exact branch/key prefix when pulling in data from a feature branch of a still unfinished source/reference model i.e. perhaps during a teseting phase.

Here, also bringing all into 1 bucket plus the additional branching logic would be needed.

nicholasyager commented 5 months ago

Hi @akromish 👋🏻 Just checking in to see if you've run into any snags on this. Let me know if you'd like another set of 👀

akromish commented 5 months ago

Hey, sorry got tied up with some other things, let me try to get a PR out next week.