iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.6k stars 1.18k forks source link

dvc stage: params section with variable #10528

Open ermolaev94 opened 4 weeks ago

ermolaev94 commented 4 weeks ago

Bug Report

Description

I have the following dvc pipeline with the following stage:

stages:
  process:
    foreach: ${datasets}
    do:
      cmd: >-
        python ${GEN_SCRIPTS_ROOT}/process_ds.py
        --ds-root ${DS_ROOT}/${item}/h5-corrected/
        --out ${PPL_PTH}/processed/${item}
        --config ${PPL_PTH}/config.yaml
        --num-workers 4
        --buffer-size 4
        --force
      deps:
        - ${GEN_SCRIPTS_ROOT}/process_ds.py
        - ${DS_ROOT}/${item}/h5-corrected/train/
        - ${DS_ROOT}/${item}/h5-corrected/val/
        - ${DS_ROOT}/${item}/h5-corrected/test/
      params:
        - ${PPL_PTH}/config.yaml:
            - processing
      outs:
        - ${PPL_PTH}/processed/${item}/train/
        - ${PPL_PTH}/processed/${item}/val/
        - ${PPL_PTH}/processed/${item}/test/
        - ${PPL_PTH}/processed/${item}/log.txt
      wdir: ${WDIR}

Its compiled version for one of the datasets is:

schema: '2.0'
stages:
  process@fractures_0124_seg:
    cmd: python ds_gen//process_ds.py --ds-root data/full_datasets//fractures_0124_seg/h5-corrected/
      --out pipelines/02_seg//processed/fractures_0124_seg --config pipelines/02_seg//config.yaml
      --num-workers 4 --buffer-size 4 --force
    deps:
    - path: data/full_datasets//fractures_0124_seg/h5-corrected/test/
      hash: md5
      md5: 7ceeec622eff202ebfd336857c49f6c8.dir
      size: 1032293048
      nfiles: 4
    - path: data/full_datasets//fractures_0124_seg/h5-corrected/train/
      hash: md5
      md5: 8d889e2240ac8522df681d291d4fe9b1.dir
      size: 9504569880
      nfiles: 4
    - path: data/full_datasets//fractures_0124_seg/h5-corrected/val/
      hash: md5
      md5: f2dddba6856f9f9b4f6ac07b3c4c3052.dir
      size: 925129464
      nfiles: 4
    - path: ds_gen//process_ds.py
      hash: md5
      md5: 243575ee6a8718300cb33c54b7f8ddff
      size: 1967
    params:
      pipelines/02_seg/config.yaml:
        processing:
          Resize:
            voxel_size:
              k: 2
          SpatialResize:
            shape:
            - 160
            - 160
            - -1
    outs:
    - path: pipelines/02_seg//processed/fractures_0124_seg/log.txt
      hash: md5
      md5: 3ecea9ba483e94e36cd8ac96b5d6ae89
      size: 16182
    - path: pipelines/02_seg//processed/fractures_0124_seg/test/
      hash: md5
      md5: 2ec2b5f1e794abe96e6f2c49f0dc3785.dir
      size: 126610024
      nfiles: 4
    - path: pipelines/02_seg//processed/fractures_0124_seg/train/
      hash: md5
      md5: ca53028b1b457add2ba51edd9ad4174e.dir
      size: 873577784
      nfiles: 4
    - path: pipelines/02_seg//processed/fractures_0124_seg/val/
      hash: md5
      md5: 721589477a653f1803a83010f379dd90.dir
      size: 104532792
      nfiles: 4

You can see here, that variables were correctly replaced by real values. But there is a problem:

$ dvc status dvc.yaml:process
process@fractures_0124_seg:                                                                                                                                                                                                                             
        changed deps:
                new:                config.yaml

and dvc commit --force doesn't help:

$ dvc commit dvc.yaml:process --force
(venv) ermolaev@df783b0a927d:~/projects/radml/cvl-cvisionrad-ml/ribs/pipelines/02_seg$ dvc status dvc.yaml:process                                                                                                                                      
process@fractures_0124_seg:                                                                                                                                                                                                                             
        changed deps:
                new:                config.yaml

But if I replace

      params:
        - ${PPL_PTH}/config.yaml:
            - processing

with the

      params:
        - pipelines/02_seg/config.yaml:
            - processing

Everything is ok. Note that there is no problem with variables in deps section.

Reproduce

Just create synth pipeline with the template variable in path to some params file.

Expected

I think that DVC should build & compare paths with the same logic for deps and params sections. It looks like DVC doesn't understand that variable in YAML is the same that dvc.lock has.

Environment information

Ubuntu

Output of dvc doctor:

$ dvc doctor
DVC version: 3.53.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.8.0-35-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.6
Supports:
        gdrive (pydrive2 = 1.19.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.34.131)
Config:
        Global: /home/ermolaev/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sdc1
Caches: local
Remotes: gdrive, gdrive, gdrive, s3
Workspace directory: ext4 on /dev/sdb1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/7205a6ce3131e59a2db7211a94dd5faa

Additional Information (if any):

dberenbaum commented 3 weeks ago

So the problem is that ${PPL_PTH}/config.yaml/config.yaml in params is not getting expanded to pipelines/02_seg/config.yaml in dvc status, correct? Is it a problem for other commands as far as you know? Where is ${PPL_PATH} defined?

ermolaev94 commented 3 weeks ago

So the problem is that ${PPL_PTH}/config.yaml/config.yaml in params is not getting expanded to pipelines/02_seg/config.yaml in dvc status, correct? Is it a problem for other commands as far as you know? Where is ${PPL_PATH} defined?

I think yes. Command dvc commit -f also not able to catch that no changes are necessary. Parameter ${PPL_PATH} is defined in params.yaml. As I remember definition in vars section doesn't help also.

dberenbaum commented 3 weeks ago

~@skshetry will know better how the internals work here, but I think DVC loads the parameters first to then fill variables from the dvc.yaml template. So I think it becomes circular to use those variables to read the path to the parameters file. Maybe we should note in the docs that variables cannot be in the params section.~

Sorry, the above looks to be incorrect on further inspection. Can you share the params.yaml? So far, I can't seem to reproduce the issue.