Containerized traindiff tests are failing.

mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.

https://mlcommons.org/en/groups/research-algorithms/

Apache License 2.0

321 stars 62 forks source link

Containerized traindiff tests are failing. #643

Closed priyakasimbeg closed 6 months ago

priyakasimbeg commented 7 months ago

Fast repro in containerized env:

pull docker image docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_dev
run container and bash into it. docker run -t -d us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_dev --keep_container_alive true docker exec -it <container_id> /bin/bash
Run test: cd algorithmic-efficiency python3 tests/test_traindiffs.py

Full repro in test env:

Make sure GH actions runner is installed and your machine has been added to our self-hosted runners.
start actions runner from $HOME/actions-runner run: ./run.sh
Make a new branch and disable the tests and disable the other regression tests so you don't have to wait for all of those tests to finish.
Make a dummy PR into the main branch to trigger the automated tests. I would just temporarily remove the .github/worklfows/regression_test*.yml.

priyakasimbeg commented 6 months ago

Fixed https://github.com/mlcommons/algorithmic-efficiency/pull/649.

priyakasimbeg commented 6 months ago

Reopening to track FastMRI traindiff test failure.

priyakasimbeg commented 6 months ago

@runame @chandramouli-sastry FastMRI traindiff tests looks fine in docker containerized test

==================================================================Testing fastmri===================================================================
|        Iter        |     Eval (jax)     |    Eval (torch)    |  Grad Norm (jax)   | Grad Norm (torch)  |  Train Loss (jax)  | Train Loss (torch) |
====================================================================================================================================================
|         0          |      1.11668       |      1.11668       |      1.09795       |      1.09796       |      1.11789       |      1.11789       |
|         1          |      1.11548       |      1.11548       |     1.0957099      |      1.09571       |      1.11668       |      1.11668       |
|         2          |      1.11429       |      1.11429       |      1.09367       |      1.09368       |      1.11548       |      1.11548       |
|         3          |       1.1131       |       1.1131       |      1.09162       |      1.09162       |      1.11429       |      1.11429       |
|         4          |      1.11191       |      1.11191       |      1.08919       |      1.08919       |     1.1130999      |       1.1131       |
|         5          |      1.11073       |      1.11073       |      1.08695       |      1.08695       |      1.11191       |      1.11191       |
|         6          |      1.10956       |      1.10956       |      1.08499       |      1.08475       |     1.1107299      |      1.11073       |
|         7          |      1.10839       |      1.10839       |      1.08248       |      1.08248       |      1.10956       |      1.10956       |
|         8          |      1.10722       |      1.10722       |      1.08032       |      1.08031       |      1.10839       |      1.10839       |
|         9          |      1.10606       |      1.10606       |      1.07817       |      1.07817       |     1.1072199      |      1.10722       |
====================================================================================================================================================