Closed priyakasimbeg closed 6 months ago
Reopening to track FastMRI traindiff test failure.
@runame @chandramouli-sastry FastMRI traindiff tests looks fine in docker containerized test
==================================================================Testing fastmri===================================================================
| Iter | Eval (jax) | Eval (torch) | Grad Norm (jax) | Grad Norm (torch) | Train Loss (jax) | Train Loss (torch) |
====================================================================================================================================================
| 0 | 1.11668 | 1.11668 | 1.09795 | 1.09796 | 1.11789 | 1.11789 |
| 1 | 1.11548 | 1.11548 | 1.0957099 | 1.09571 | 1.11668 | 1.11668 |
| 2 | 1.11429 | 1.11429 | 1.09367 | 1.09368 | 1.11548 | 1.11548 |
| 3 | 1.1131 | 1.1131 | 1.09162 | 1.09162 | 1.11429 | 1.11429 |
| 4 | 1.11191 | 1.11191 | 1.08919 | 1.08919 | 1.1130999 | 1.1131 |
| 5 | 1.11073 | 1.11073 | 1.08695 | 1.08695 | 1.11191 | 1.11191 |
| 6 | 1.10956 | 1.10956 | 1.08499 | 1.08475 | 1.1107299 | 1.11073 |
| 7 | 1.10839 | 1.10839 | 1.08248 | 1.08248 | 1.10956 | 1.10956 |
| 8 | 1.10722 | 1.10722 | 1.08032 | 1.08031 | 1.10839 | 1.10839 |
| 9 | 1.10606 | 1.10606 | 1.07817 | 1.07817 | 1.1072199 | 1.10722 |
====================================================================================================================================================
Fast repro in containerized env:
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_dev
docker run -t -d us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_dev --keep_container_alive true
docker exec -it <container_id> /bin/bash
cd algorithmic-efficiency
python3 tests/test_traindiffs.py
Full repro in test env:
./run.sh