nextstrain / conda-base

Conda package build for nextstrain-base
https://anaconda.org/Nextstrain/nextstrain-base
1 stars 1 forks source link

rethinkdb required for seasonal flu builds to work #3

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

Current Behavior

Using the Nextstrain CLI with the managed conda environment, I tried to download data from fauna (the first step of the seasonal flu builds) like so:

nextstrain build --conda . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv

This command failed with a module not found error because fauna needs the rethinkdb module and it is not installed in the managed conda environment. In contrast, the following command works as expected:

nextstrain build --docker . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv

Expected behavior

Managed conda environment should behave like Docker environment.

How to reproduce

Steps to reproduce the current behavior:

  1. Install the CLI from the trs/conda/nextstrain-base-package branch of the GitHub repo
  2. Setup the managed conda environment (NEXTSTRAIN_CONDA_CHANNEL=nextstrain/label/pull-1 nextstrain setup conda, NEXTSTRAIN_CONDA_CHANNEL=nextstrain/label/branch-initial nextstrain update conda, and nextstrain setup --set-default conda)
  3. Clone the seasonal flu repo and git checkout refactor-workflow
  4. From the seasonal flu repo, run nextstrain build --conda . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv
  5. See error.

Possible solution

Since I have fauna installed in the parent directory of my workflow (where the workflow expects to find it), installing rethinkdb should be enough to fix this issue. We should not need to install fauna.

huddlej commented 1 year ago

I looked into what we'd actually need to include rethinkdb in this environment. Some important factors are:

To include the specific rethinkdb version we need for fauna in this base environment, it looks like we need to create our own conda package for this version. We could host it in the Nextstrain channel. Bioconda is not an appropriate place for it and I don't think we want to support a conda-forge package for rethinkdb (or give the impression we are responsible for rethinkdb).

@tsibley Does this summary make sense? I'd love to learn how to use our Nextstrain channel through Anaconda, so I could try setting up the rethinkdb package.

tsibley commented 1 year ago

Thanks for the great description here, @huddlej! I was wondering how fast

https://github.com/nextstrain/conda-base/blob/c0c6f67cc89e946abbe391721a8fe43a6f7967ff/src/recipe.yaml#L39

was going to come back to haunt me. Pretty fast it turns out! :upside_down_face:

Agreed that the thing to do here given the constraints you outlined is to produce our own Conda package for the Python RethinkDB bindings and host it in our channel. I think this should be relatively straightforward (if involving some minor tedium), and I'd be happy to help guide you through it.

Some things to consider:

huddlej commented 1 year ago

Cool, let's do it. Maybe end of next week? This feels like a nice Friday afternoon kind of task...

I like nextstrain-fauna-rethinkdb for its specificity. I don't think we need to package fauna; like you said, most workflows that still use it expect to find fauna as a sibling directory of the workflow directory.

tsibley commented 1 year ago

It's a plan!

…most workflows that still use it expect to find fauna as a sibling directory of the workflow directory.

Yeah, the Docker runtime intentionally locates Fauna's source at /nextstrain/fauna so its a sibling of /nextstrain/build (the working dir into which the host dir is mounted). I wonder if the Conda runtime could help arrange a similar sibling dir somehow too during nextstrain build. Will idly ponder it.

huddlej commented 1 year ago

@tsibley Based on our other in person discussion yesterday, I wonder if we should focus instead on migrating remaining fauna-based workflows to our S3-hosted data approach. This is always the issue of deciding how long to keep supporting a legacy system that everyone relies on, but if I had to choose between a) running the seasonal flu workflow as it is with managed conda environment and b) running the seasonal flu workflow with S3-hosted data, I would pick the latter.

tsibley commented 1 year ago

@huddlej Ah, indeed, my preference would be to advocate for (b) too, but I guess I don't see this as having to be either-or. I don't think it'll take very long to make the nextstrain-fauna-rethinkdb package. I'd be happy to Just Do It (and stop if it turns out not to be easy), but also wanted to support your interest in learning about using our Conda channel. :-)

huddlej commented 1 year ago

I'd rather push for the S3-hosted data, instead, unless anyone else on the team has a strong desire to run fauna-based builds with the managed Conda environment right now. This might just be @joverlee521 and @j23414 right now?

joverlee521 commented 1 year ago

I only run the fauna-based builds with Docker so no desire to include fauna/rethinkdb here. I would be happy to push for (b).

tsibley commented 1 year ago

Closing this as won't fix. We can re-open if (b) doesn't come to pass in a reasonable time and we decide to just make nextstrain-fauna-rethinkdb.

corneliusroemer commented 1 month ago

Came here from https://github.com/nextstrain/docker-base/issues/222

A few thoughts:

rethinkdb does not have an official conda package for the Python bindings. There is an unofficial package for version 2.0.0. Even if fauna works with this older version of rethinkdb (I haven't checked), I don't think we should rely on channels from individual users. To include the specific rethinkdb version we need for fauna in this base environment, it looks like we need to create our own conda package for this version. We could host it in the Nextstrain channel. Bioconda is not an appropriate place for it and I don't think we want to support a conda-forge package for rethinkdb (or give the impression we are responsible for rethinkdb).

I don't see why one couldn't create a conda-forge package and include this particular version, as well as more recent versions.

Creating a conda-forge package in no way implies responsibility for underlying software.

With rethinkdb only releasing new versions very sporadically, it wouldn't be hard to keep it up to date - there's also no obligation to do so.

So in principle I don't see why one couldn't add fauna to bioconda and the dependencies to conda-forge/bioconda as required.