mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.58k stars 549 forks source link

Changed Dockerfile for image classification to use specific versions of python modules, made a change in resnet_run_loop.py #455

Closed sanjaychari closed 1 year ago

sanjaychari commented 3 years ago

I faced multiple errors while trying to build an image from the existing Dockerfile for image classification.

First, I faced “RuntimeError: Python version >= 3.7 required.”, as the latest versions of numpy and scipy require Python >=3.7.

I changed the numpy and scipy versions accordingly and tried tensorflow-gpu 1.12.0, but faced an issue similar to https://github.com/tensorflow/tensorflow/issues/16478, so I added futures==3.1.1 to the Dockerfile because of that. I had to use tensorflow-gpu 1.10.0 instead of tensorflow-gpu 1.12.0, in order to fix https://github.com/mlcommons/training/issues/204.

I also faced https://bbs.archlinux.org/viewtopic.php?id=261412, which is why I added h5py==2.10.0 in the Dockerfile.

official/requirements.txt was missing the line mlperf_compliance==0.0.6, so I had to add it there.

After successfully building the Docker image, I faced https://github.com/mlcommons/training/issues/223, and followed the suggested solution in the thread to fix the issue.

Finally, I had to download nccl2 manually because the runtime image didn't seem to support nccl2.

After making these changes, I was able to start the run successfully.

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 3 years ago

@sgpyc could you review this since it's an rn50 change?

sgpyc commented 3 years ago

I don't know whether we should still keep the old RN50 code: PR443 has a newer version using TF2, and the RCPs are from the new version. I'm not against updating the TF1 model, but the stack it based on is pretty outdated (e.g. tf 1.10.0 was from Aug 2018). It's close to impossible to validate a model based on such an old stack, particularly with distributed settings.

The best I can do is to build the image, and run it on a single GPU machine, without testing on distributed settings and also can't confirm whether that's the stack the original model intended to be run (e.g. NCCL2 2.5.6-1 was released on Nov 2019, and unlikely to be the intended version for tf 1.10.0 or 1.12.0).

johntran-nv commented 3 years ago

@sgpyc it sounds like you're voting to just close this one without merging, right?

Are others ok with that? @bitfort , anyone else?

matthew-frank commented 1 year ago

This PR provides changes to the old, retired, TF1 version of the image_classificaiton benchmark. This benchmark has been replaced with a new TF2 version that may have slightly different semantics.

In an effort to do a better job maintaining this repo, we're closing PRs for retired benchmarks. The old benchmark code still exists, but has been moved to https://github.com/mlcommons/training/tree/master/retired_benchmarks/resnet-tf1.

If you think there is useful cleanup to be done to the retired_benchmarks subtree, please submit a new PR.