mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
320 stars 61 forks source link

Two copies of `criteo_resnet_pytorch` exist in .github/workflows/regression_tests_variants.yml #704

Open tfaod opened 5 months ago

tfaod commented 5 months ago

Two copies of criteo_resnet_pytorch exist in .github/workflows/regression_tests_variants.yml

Is this intentional? If not, which is the correct version? Thanks!!

  criteo_resnet_pytorch:
    runs-on: self-hosted
    needs: build_and_push_pytorch_docker_image
    steps:
    - uses: actions/checkout@v2
    - name: Run containerized workload
      run: |
        docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
        docker run  -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d criteo1tb -f pytorch -s reference_algorithms/paper_baselines/adamw/pytorch/submission.py -w criteo1tb_resnet -t reference_algorithms/paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false
  criteo_resnet_pytorch:
    runs-on: self-hosted
    needs: build_and_push_pytorch_docker_image
    steps:
    - uses: actions/checkout@v2
    - name: Run containerized workload
      run: |
        docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
        docker run  -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d criteo1tb -f pytorch -s reference_algorithms/paper_baselines/adamw/pytorch/submission.py -w criteo1tb_embed_init -t reference_algorithms/paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false
priyakasimbeg commented 5 months ago

Nope, that is not intentional. The regression tests for the variants are a work in progress. It looks like the bottom one actually tests criteo1tb_embed_init and not criteo1tb_resnet, so the top one runs the intended workload.