Cache run exports are propagating to outputs

vyasr commented 1 month ago

I'm not sure if this is a bug or just me misreading the documentation. The documentation on the experimental cache feature states that

If the cache has an "ignore run exports" section, than we apply those filters at the cache level.

Since the cache does not itself produce any outputs, I assumed that the above statement meant that ignoring run exports in the cache would propagate to any outputs. However, that does not appear to be the case. Given the recipe below:

recipe:
  version: 1.0.0

source:
  path: .

cache:
  build:
    script:
      content: "echo foo"
  requirements:
    ignore_run_exports:
      from_package:
        - numpy
    host:
      - numpy

outputs:
  - package:
      name: example
    build:
      script:
        content: "echo bar"
    requirements:
      # Why is this necessary?
      #ignore_run_exports:
      #  from_package:
      #    - numpy
      host:
        - python

I observe the following:

$ rattler-build build --experimental --no-build-id
$ cd output/linux-64
$ cph x example-1.0.0-hb0f4dca_0.conda
$ cat example-1.0.0-hb0f4dca_0/info/index.json
{
  "arch": "x86_64",
  "build": "hb0f4dca_0",
  "build_number": 0,
  "depends": [
    "numpy >=1.19,<3",
    "python_abi 3.12.* *_cp312"
  ],
  "name": "example",
  "platform": "linux",
  "subdir": "linux-64",
  "timestamp": 1721520871303,
  "version": "1.0.0"
}%

If I include the commented out ignore run export in the package output section, then the numpy dependency is removed as I intended. Is that supposed to be necessary? If so, the documentation on the cache should be updated. However, I'm not sure what purpose ignoring run exports in the cache would serve if those run exports are propagated to the outputs.

vyasr commented 1 month ago

I should note that I'm testing with the latest released version 0.18.1, not the trunk. I checked the current diff with main and didn't see anything relevant, but wanted to mention it in case I was missing something.

wolfv commented 1 month ago

Hi @vyasr - thanks for filing this bug. The intended behavior is the one mentioned in the docs, so it does seem like you found a bug!

I would also be curious if you have thoughts in general on this? The previous multi-output / cache thing from conda-build didn't really specify any of this as far as I can tell.

I'll try to get to this bug this week!

vyasr commented 1 month ago

The documented behavior is definitely what I would expect. You are right that the conda-build implementation of multiple outputs didn't specify any of this cleanly, and also has a lot of sharp edges that we've hit in different scenarios, so I'm happy to see this being redesigned from the ground up in rattler-build and in the new spec (I see that at some point it was being discussed but was removed from CEP 14 for now).

In general for a multi-output recipe there's no way for rattler-build to know what cache dependencies to propagate to which outputs, so it needs extra information from somewhere. The ability to specify ignoring run exports at both the cache and output level seems like a pretty good way to handle this. Another option would be to introduce a new syntax for just the cache to indicate which exports should propagate to which outputs, but I don't really like that since it introduces new and more complex syntax with little benefit.

The more extreme option that I could see being viable is for run exports to not propagate out of the cache at all. This would require more work on the part of recipe maintainers to specify everything on a per-output basis. I expect that for the majority of recipes that have a small number of outputs this would be much more painful, but for the small set of recipes that have many outputs it could be better. It would typically result in much more minimal dependency trees though since it reduces the risk of accidental dependency propagation. That would be one reason to consider it.

wolfv commented 1 month ago

One idea I had was to trace if an output depends (exactly?) on another output that already had the run exports applied, and if that's the case not re-depend on the same run exports.

But it's just a small optimization that mainly decreases the size of the repodata and probably shouldn't really affect the solver speed etc.

Curious about your thoughts :)

vyasr commented 1 month ago

tl;dr that seems like a good idea and I could get on board with that.

On one hand, that seems like a bit of code smell to me. It has the same flavor of failing to "include what you use" in compiled languages, or in Python relying on a transitive package dependency to express a direct dependency. If package foo depends on both bar and baz, and bar depends on baz, then foo doesn't have to specify baz in the package metadata to have it be present in any environment that it installs. The risk is that bar stops needing baz, and all of a sudden foo breaks because an expected dependency is missing.

On the other hand, in this case the dependencies are much more tightly coupled. Since it is between outputs in a given recipe, if the dependency in one output was removed then the dependents of that output would immediately see that change and add them, so there's no risk of getting out of sync. Therefore I think it's probably fine to do this.

The case where it could improve the solver speed is if there are outputs that are not directly dependent on each other, but that's probably pretty rare.

wolfv commented 1 month ago

You could try again 0.19.0 which just landed on conda-forge.

vyasr commented 3 weeks ago

Was on holiday the past couple of weeks. Just verified that this works now, thanks!

wolfv commented 3 weeks ago

Awesome!

prefix-dev / rattler-build

Cache run exports are propagating to outputs #977