iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

`dvc exp run --queue` should pass environment variables to queued tasks #10464

Open igordertigor opened 1 week ago

igordertigor commented 1 week ago

I like to structure my projects with two source folders of the form

src/
  scripts/
  shared/

where scripts contains entrypoints for dvc stages that can be run as python src/scripts/myscript.py. Typically, those scripts would import more global properties form src/shared as

from shared import xxx

This requires setting PYTHONPATH=src and works fine with dvc repro but it fails with dvc exp run --queue because the queued experiment doesn't copy the PYTHONPATH variable. I would like it if either environment variables would generally be copied or alternatively there would be a setting similar to tox passenv that allows specifying which variables to pass. It feels like experiments are basically unusable without this feature.

shcheklein commented 1 week ago

@igordertigor how do you set it up for dvc repro? export or something like PYTHONPATH=src dvc repro? what is the exact value you pass to it? could you try to dump the os.environ in the script to see what values are being picked up and what not?

can you also do relative imports instead - even with dvc repro arguably it would be easier ...

igordertigor commented 1 week ago

I'm setting it with export. My current workaround is to also set it in the scripts via sys.path.append, which is kind of awkward because I need to run that part before all the other imports.

Regarding the the os.environ dump, i would need to wait a bit. That would probably take until next week.

Is your relative imports suggestions something like this: dvc.yaml:

stages:
  mystage:
    cmd: python -m src.scripts.myscript
    deps:
      - src/scripts/myscript.py
      - src/shared/mymodule.py

In src/scripts/myscript.py:

from ..shared import mymodule
...

If so, I think this should be mentioned in the documentation. Running scripts via python -m isn't necessarily the first thing I would think of.

shcheklein commented 1 week ago

Thanks for the details. I mean more or less something like from ..shared import mymodule. Probably you could also avoid using -m and do python path-to-file (?)

how / where do you run the queue then? is it happening in a different terminal?

if you do export PYTHOPATH=src it should be picked up by the dvc queue start as far as I can tell.

Regarding the the os.environ dump, i would need to wait a bit. That would probably take until next week.

thanks