mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
321 stars 62 forks source link

Cannot Build PyTorch Docker Image #717

Closed hjmshi closed 5 months ago

hjmshi commented 5 months ago

We are not able to use the provided Dockerfile to successfully build a PyTorch docker image.

Description

When seeking to build the PyTorch docker image from the provided Dockerfile, we encounter the bug:

141.9 Looking in links: https://download.pytorch.org/whl/cu121                                                                                                   
141.9 Obtaining file:///algorithmic-efficiency                                                                                                                   
142.7 Requirement already satisfied: absl-py==1.4.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (1.4.0)                       
142.7 Requirement already satisfied: clu==0.0.7 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (0.0.7)
142.8 Requirement already satisfied: docker==7.0.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (7.0.0)
142.8 Requirement already satisfied: gputil==1.4.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (1.4.0)                        
142.8 Requirement already satisfied: matplotlib>=3.7.2 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (3.7.5)                    
142.8 Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (1.24.4)
142.8 Requirement already satisfied: pandas>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (2.0.3)                        
142.9 Requirement already satisfied: psutil==5.9.5 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (5.9.5)                        
142.9 Requirement already satisfied: tabulate==0.9.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (0.9.0)
142.9 Requirement already satisfied: tensorflow-addons==0.20.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (0.20.0)           
143.0 Requirement already satisfied: tensorflow-datasets==4.9.2 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (4.9.2)           
143.2 Requirement already satisfied: tensorflow-probability==0.20.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (0.20.0)      
143.2 Requirement already satisfied: tensorflow==2.12.0 in /usr/local/lib/python3.8/dist-packages (from algorithmic-efficiency==0.1.0) (2.12.0)                  144.1 ERROR: Could not find a version that satisfies the requirement torch==2.1.0+cu118 (from algorithmic-efficiency==0.1.0) (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1)                                                                                                                                                                144.1 ERROR: No matching distribution found for torch==2.1.0+cu118 (from algorithmic-efficiency==0.1.0) 

We are not super confident as to why it looks for a PyTorch 2.1.0 + CUDA 11.8 distribution.

cc @tsunghsienlee @anana10c @mikerabbat @shintaro-iwasaki @tfaod @adefazio @yuchenhao

Steps to Reproduce

Run docker build -t <docker_image_name> . --build-arg framework=pytorch.

Source or Possible Fix

We were able to fix this by modifying line 48 (and similar) of the Dockerfile to:

&& pip install torch==2.1.0 torchvision==0.16.0 -f 'https://download.pytorch.org/whl/cu121'; \

Thanks in advance!

priyakasimbeg commented 5 months ago

Hi! Is your code synced to main? I don't think we have torch==2.1.0+cu118 in our setup.cfg on main anymore.

priyakasimbeg commented 5 months ago

Maybe try building with the --no_cache flag e.g.docker build --no_cache ...

tsunghsienlee commented 5 months ago

Hi @priyakasimbeg , thanks for the suggestion, and --no-cache fix the issue at our side.