nicholasyager / dbt-loom

A dbt-core plugin to weave together multi-project dbt-core deployments
The Unlicense
104 stars 19 forks source link

Protected seeds in upstream project breaking dbt runs in downstream project #52

Closed theodotdot closed 4 months ago

theodotdot commented 4 months ago

Describe the bug Running any dbt command that requires compiling from Project B using the manifest from Project A throws the following error:

08:25:16  Running with dbt=1.7.14
08:25:17  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
08:25:17  dbt-loom: Loading manifest for `project_a` from `file`
08:25:17  Registered adapter: bigquery=1.7.7
08:25:27  dbt-loom: Injecting nodes
08:25:28  Encountered an error:
Compilation Error
  'model.project_a.stg_my_seed_file' depends on 'seed.project_a.seed_my_seed_file' which is not in the graph!

This error refers to a seed and model that are not upstream of the model I was trying to run. In Project A, there are no errors when running or compiling.

I tried running with --no-partial-parse as well as running dbt clean before the commands to no avail. Deleting the seed and related model only gave the same error with another seed and model pair.

I have looked at the manifest.json from Project A and I can see the seed node, however, when looking at the manifest generated from the run in Project B, I can only see the seed node as a dependency from the staging model but doesn't seem to be injected as a node itself.

To Reproduce Steps to reproduce the behavior:

  1. Project A has protected seeds that are upstream of some other protected models
  2. Project A has a public model that is downstream of one or more seeds.
  3. Project A compiles with no errors
  4. Project B references a public model from the upstream model
  5. Run any dbt command that requires compiling (e.g. dbt run)
  6. Get the error above

Expected behavior Project B should also compile without any errors.

Additional context Thanks to smilingthax (don't want to tag to spam emails) for pointing out that this error with seeds that I mentioned in another issue was talked already in another issue (that I missed when researching).

I haven't had time to rebuild this with a toy project at the moment, if that is needed I will take some time later in the week as I have been pushing at work to give dbt-loom work so I am invested hah. Thanks for all the efforts and for creating and maintaining this amazing project!!

nicholasyager commented 4 months ago

I knew I forgot something from yesterday's issue thread 😅 Thanks for creating this issue, @theodotdot. As you mentioned, I did resolve a previous defect w/ cross-project seed references in 0.5.1, so there is a precedent here for edge cases with seeds.

I'm keen to see your toy example if you're able to put one together. Otherwise, I may benefit from seeing the relevant portions of your projects' manifest files.

theodotdot commented 4 months ago

No worries, I will try to take some time to tinker with that and see if I can replicate in an env that you can also use on your machine and I will update here! I will update you either way later this week

theodotdot commented 4 months ago

Good news! (or bad depending on how you see it)

@nicholasyager I was able to replicate it using the test projects in the dbt-loom repo.

To Reproduce Steps to reproduce the behavior: (running dbt-core 1.7.14)

12:17:16  Running with dbt=1.7.14
12:17:16  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
12:17:16  dbt-loom: Loading manifest for `revenue` from `file`
12:17:16  Registered adapter: duckdb=1.7.4
12:17:16  dbt-loom: Injecting nodes
12:17:16  [WARNING]: Model orders.v1 has passed its deprecation date of 2024-01-01T00:00:00+01:00. This model should be disabled or removed.
12:17:16  Encountered an error:
Compilation Error
  'model.revenue.stg_accounts' depends on 'seed.revenue.accounts' which is not in the graph!

I tested additional scenarios:

So it seems the issue comes from having the staging model reference the seed, regardless of the access defined. Now this is where my knowledge ends, I tried to have a look at how the node injection works but I have to say I don't really understand and my python is super rusty.

Let me know if you'd like me to test other things, I hope that helps!

nicholasyager commented 4 months ago

@theodotdot Thanks for the information! I've been able to replicate the defect, so I should be able to track down the issue shortly. My current hunch is that it is another instance of unique_ids (the IDs for nodes internal to dbt-core) not matching correctly when dbt-core maps between Nodes native to a project and ModelArgNodes injected into the project.

nicholasyager commented 4 months ago

I figured it out! dbt-core's implementation of ModelArgNodes hard-codes model. as the prefix for all nodes injected into a project, which creates a mismatch in unique_ids.

flowchart LR

    subgraph project_a
        seed.project_a.accounts --> model.project_a.stg_accounts
    end

    model.project_a.stg_accounts2

    subgraph project_b

      subgraph injected_project_a
          seed.project_a.accounts2[model.project_a.accounts]
          model.project_a.stg_accounts2[model.project_a.stg_accounts]
      end

      model.project_a.stg_accounts2 --> accounts
    end

    seed.project_a.accounts -.-> seed.project_a.accounts2
    model.project_a.stg_accounts -.-> model.project_a.stg_accounts2

This isn't a problem when a model in a downstream project references a seed directly in an upstream project, since the ref macro on-to-fly resolves the seed to the injected model. unique_id.

flowchart LR

    subgraph project_a
        seed.project_a.accounts 
    end

    model.project_a.stg_accounts2

    subgraph project_b

      subgraph injected_project_a
          seed.project_a.accounts2[model.project_a.accounts]

      end
      seed.project_a.accounts2 --> model.project_a.stg_accounts2[model.project_a.stg_accounts]
      model.project_a.stg_accounts2 --> accounts
    end

    seed.project_a.accounts -.-> seed.project_a.accounts2

I see four options forward.

  1. When injecting models, remove parents that are seeds from the depends_on list. This will NOT break the DAG, since the downstream project does not need to know about the upstream project's implementation, but viewing the docs site would fail to render the dependency.
  2. When injecting models, rewrite the unique_id for each node's dependencies, such that all upstream dependencies are "models". This will NOT break the DAG, and the downstream docs site will be preserved, but now we're introducing false information into the DAG to work around dbt-core's implementation.
  3. Submit an issue/PR to dbt-core to include an optional resource_type field in the ModelArgsNode class, allowing for true replication of complete upstream DAGs into a downstream project.
  4. Monkeypatch dbt-core to include an optional resource_type field in the ModelArgsNode class, allowing for true replication of complete upstream DAGs into a downstream project.

At this point in time, I'm leaning towards option one -- strategically, downstream projects really shouldn't know or care about nodes upstream of the nodes being used. More so if the nodes are protected or private to a given project. This throws a wrench in for people using dbt docs for cross-project documentation, which isn't great. On the other hand, it keeps dbt-loom compatible with dbt Cloud's cross-project lineage functionality, which is ideal.

My fallback approach would be option two -- We're already benefiting from some unique_id rewriting (as seen in the example), so perhaps this is an opportunity to have a better experience than that offered by dbt Cloud 🤷🏻

@theodotdot do you have any preferences for the path forward?

theodotdot commented 4 months ago

Wow that is great, thanks for finding the actual issue here!

I agree with your point regarding option 1: downstream projects shouldn't need to know and while it would be great to have the docs, I can say that it wouldn't be a deal breaker.

From a usage/strategy point of view, I think that keeping dbt-loom and the dbt cloud offering compatible is good for the plugin as it enables users to switch easily but I also feel like there is usually a good reason to not use dbt cloud so I can mostly see the switch being from cloud to core (and loom) vs the opposite.

I wouldn't mind dbt-loom being even better than the dbt cloud alternative considering I know that my team and I won't be moving to Cloud (probably) ever.

Option 3 would be nice as it's probably a net positive to dbt to have that addition distinction and would be the "cleaner" and more comprehensive way of going about solving the issue.

--> I honestly feel like option 1 is the simplest to implement and for my use cases slightly broken docs doesn't matter but if you're a perfectionist I can also understand the other options, you know better and this is your project, I am just happy I will get to use dbt-loom and multi-project :smile:

nicholasyager commented 4 months ago

Thanks for the insight! I'm going to go ahead with Option 1, removing non-model dependencies from each Node's depends_on list. This will resolve the immediate issue.

theodotdot commented 4 months ago

Thanks again for the fix on this, I will test this on my production project once 0.5.3 is out!

EDIT: Tested it and it works flawlessly now, thanks!!!