rapidsai / dependency-file-generator

https://pypi.org/project/rapids-dependency-file-generator/
Apache License 2.0
15 stars 13 forks source link

feat: Generate meta.yaml dependencies #45

Closed vyasr closed 6 months ago

vyasr commented 1 year ago

This PR is an alternative to #28 that is based on generating separate dependency files instead of modifying meta.yaml in place. It is a much simpler changeset as a result. I considered putting the data into the conda_build_config.yaml file instead of using separate YAML files for each dependency set, but I rejected that option because dfg is based around writing separate files sections for every include list and the only way to map that to a single conda_build_config.yaml file is to do a lot of in-place overwriting, which we have generally decided to avoid unless absolutely necessary, i.e. for pyproject.toml (otherwise we could just do that for meta.yaml too). Willing to be convinced otherwise though.

In order to support split recipes, this approach requires writing separate files for every section in every output, which may be too verbose for our liking. Curious to hear opinions there.

vyasr commented 1 year ago

A partial example of this can be seen in https://github.com/rapidsai/cudf/pull/13022

vyasr commented 1 year ago

For versions that need to be in conda_build_config.yaml, one thought that I had is that I'm fairly certain that there's no specific reason you have to encode only a version in that file, you could also include the package names themselves. For instance, in this first example I believe the following would be equally valid:

python:
    - python 2.7
    - python 3.5

with

package:
    name: compiled-code
    version: 1.0

requirements:
    build:
        - {{ python }}
    run:
        - {{ python }}

If that works then we ought to be able to generate the entire config file with dfg. Not sure if we want to go that route or if that is in scope for this PR, though.

vyasr commented 1 year ago

We should also keep in mind the alternative proposed by @jakirkham in https://github.com/rapidsai/rmm/pull/1220. I've already outlined my thoughts on the pros and cons there, so just restarting that conversation now.

ajschmidt8 commented 1 year ago

After reading about the compiler() function in the two sections below, I don't think we need to use them at all.

It seems like that function is mostly used for cross-compiling, which we don't do (we natively compile on amd64 and arm64).

From these conda-forge docs, it seems that compiler('cuda') is only responsible for adding nvcc to the build requirements and also adding cudatoolkit to the run requirements. But we generally ignore those run requirements anyway (1,2) since we manage that ourselves.

So I'm assuming we can just replace compiler('cuda') with nvcc.

@jakirkham, can you corroborate all this?

vyasr commented 1 year ago

Overall this seems like a rather trivial change and I would have no major objections. But I'm wondering whether it is too trivial, considering that a conda_meta file can be fairly complex.

To be clear, we are not generating a full recipe anymore, just simple YAML files containing dependency lists. That's basically a subset of what we already need for conda environment.yaml files (same thing, but without the channels). That's why this is so simple right now.

Can you add at least one valid example input and expected output (and thus test)?

I will when I get a chance. This work has temporarily been deprioritized since we don't want to make this kind of change until after we have CUDA 12 compatible recipes, but once that's done I'll be getting back to this and will add some then.

How does this mesh with tools like Grayskull?

Grayskull's goals are a bit different. IIUC it's for generating an entire recipe given a PyPI package or a repo. However, the resulting package will be too simple for our purposes since it won't encode all of the extra information we put in our recipes around things like run_exports for instance, nor will it support complex features like split outputs. This PR is an approach that lets us share dependencies with the rest of our infrastructure while still having the flexibility to fill out the non-dependency parts of the recipe, which strikes the best balance for us.

csadorf commented 1 year ago

Overall this seems like a rather trivial change and I would have no major objections. But I'm wondering whether it is too trivial, considering that a conda_meta file can be fairly complex.

To be clear, we are not generating a full recipe anymore, just simple YAML files containing dependency lists. That's basically a subset of what we already need for conda environment.yaml files (same thing, but without the channels). That's why this is so simple right now.

So the conda-meta semantics here means "this will be needed for a conda recipe" rather than "we are generating a conda-meta file"?

How does this mesh with tools like Grayskull?

Grayskull's goals are a bit different. IIUC it's for generating an entire recipe given a PyPI package or a repo. However, the resulting package will be too simple for our purposes since it won't encode all of the extra information we put in our recipes around things like run_exports for instance, nor will it support complex features like split outputs. This PR is an approach that lets us share dependencies with the rest of our infrastructure while still having the flexibility to fill out the non-dependency parts of the recipe, which strikes the best balance for us.

Sure, but I would be surprised if some of our downstream processing tools wouldn't share at least some of the logic. Or on the flipside, maybe a Grayskull generated recipe could be further processed. Or we add missing features to Grayskull. Maybe none of this is appropriate, but I wanted to mention this to make sure it has at least been considered.

vyasr commented 1 year ago

So the conda-meta semantics here means "this will be needed for a conda recipe" rather than "we are generating a conda-meta file"?

Correct. Happy to rename if you think there is a better way to convey this.

Sure, but I would be surprised if some of our downstream processing tools wouldn't share at least some of the logic. Or on the flipside, maybe a Grayskull generated recipe could be further processed. Or we add missing features to Grayskull. Maybe none of this is appropriate, but I wanted to mention this to make sure it has at least been considered.

These are good thoughts. My opinion here is that conda recipes are fairly complex configuration files, and we don't want to be in the business of dealing with them directly. dfg is purely about dependencies.

In the case of pyproject.toml files, we can modify in place largely because a package exists (tomlkit) that supports format-preserving in-place modification. Since conda recipes are a superset of YAML (they also support Jinja etc), we have no way to do the same without rolling our own parser. Moreover, the complexity of split recipes means that we would need a lot of complexity in dfg to handle writing to subsections that are in split outputs vs at the top level of a recipe. As a result, we chose to go this route of generating separate files for each dependency set. Even if we were able to modify in place, however, we still wouldn't want to modify any section aside from dependency lists. As a result, I would expect that the only overlap we have with Grayskull is in Grayskull's ability to generate the dependency sections of meta.yaml from either setup.py or pyproject.toml (I haven't looked at how it does this). I'm not sure how much work it's worth to try and find that common ground in order to share code/logic.

vyasr commented 6 months ago

I'm going to close this for now. A decent bit has changed since this initial implementation. Also, depending on if we move forward with https://github.com/rapidsai/build-planning/issues/47 there may be much better ways to accomplish our goals in this PR (i.e. writing out meta.yaml from a template directly). We should still address this issue, but the code in this PR is probably no longer the best way to do so.