Unable to reproduce the results using the official training script

Hello,

Unfortunately I have been unable to reproduce reported results despite a lot of efforts.

Installation

conda create -y -n ffcv python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge

As PyTorch 1.13 has been released and it only supports CUDA 11.6 and 11.7, the above installed the CPU version of PyTorch 1.13. Thus I had to install again PyTorch 1.12.1 for compatibility with CUDA 11.3:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

And installation of dependencies:

conda activate ffcv
pip install torchmetrics

Create dataset

git clone https://github.com/libffcv/ffcv-imagenet.git
cd ffcv-imagenet
export IMAGENET_DIR=$HOME/data/imagenet
export WRITE_DIR=$HOME/data/imagenet_ffcv/jpg50
bash write_imagenet.sh 500 0.50 90

Results:

(ffcv) bash write_imagenet.sh 500 0.50 90
Writing ImageNet train dataset to /home/data/imagenet_ffcv/jpg50/train_500_0.50_90.ffcv
┌ Arguments defined────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ Parameter                │ Value                                                                           │
├──────────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ cfg.dataset              │ imagenet                                                                        │
│ cfg.split                │ train                                                                           │
│ cfg.data_dir             │ /home/data/imagenet/train                                                       │
│ cfg.write_path           │ /home/data/imagenet_ffcv/jpg50/train_500_0.50_90.ffcv                           │
│ cfg.write_mode           │ jpg                                                                             │
│ cfg.max_resolution       │ 500                                                                             │
│ cfg.num_workers          │ 64                                                                              │
│ cfg.chunk_size           │ 100                                                                             │
│ cfg.jpeg_quality         │ 90.0                                                                            │
│ cfg.subset               │ -1                                                                              │
│ cfg.compress_probability │ 0.5                                                                             │
└──────────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1281167/1281167 [3:38:32<00:00, 97.70it/s]
Writing ImageNet val dataset to /home/data/imagenet_ffcv/jpg50/val_500_0.50_90.ffcv
┌ Arguments defined────────┬───────────────────────────────────────────────────────────────────────────────┐
│ Parameter                │ Value                                                                         │
├──────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ cfg.dataset              │ imagenet                                                                      │
│ cfg.split                │ val                                                                           │
│ cfg.data_dir             │ /home/data/imagenet/val                                                       │
│ cfg.write_path           │ /home/data/imagenet_ffcv/jpg50/val_500_0.50_90.ffcv                           │
│ cfg.write_mode           │ jpg                                                                           │
│ cfg.max_resolution       │ 500                                                                           │
│ cfg.num_workers          │ 64                                                                            │
│ cfg.chunk_size           │ 100                                                                           │
│ cfg.jpeg_quality         │ 90.0                                                                          │
│ cfg.subset               │ -1                                                                            │
│ cfg.compress_probability │ 0.5                                                                           │
└──────────────────────────┴───────────────────────────────────────────────────────────────────────────────┘
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [09:28<00:00, 87.95it/s]

Training

I tried ResNet50 for 16 epochs and expected the training to complete in about 15 minutes, as reported.

(ffcv) python train_imagenet.py --config-file rn50_configs/rn50_16_epochs.yaml \                                                                                                                 
>     --data.train_dataset=$HOME/data/imagenet_ffcv/jpg50/train_500_0.50_90.ffcv \                                                                                                               
>     --data.val_dataset=$HOME/data/imagenet_ffcv/jpg50/val_500_0.50_90.ffcv \                                                                                                                   
>     --data.num_workers=8 --data.in_memory=1 \                                                                                                                                                  
>     --logging.folder=$HOME/experiments/ffcv                                                                                                                                                    
┌ Arguments defined────────┬─────────────────────────────────────────────────────────────────────────────────┐                                                                                   
│ Parameter                │ Value                                                                           │                                                                                   
├──────────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤                                                                                   
│ model.arch               │ resnet50                                                                        │                                                                                   
│ model.pretrained         │ 0                                                                               │                                                                                   
│ resolution.min_res       │ 160                                                                             │                                                                                   
│ resolution.max_res       │ 192                                                                             │                                                                                   
│ resolution.end_ramp      │ 13                                                                              │                                                                                   
│ resolution.start_ramp    │ 11                                                                              │                                                                                   
│ data.train_dataset       │ /home/data/imagenet_ffcv/jpg50/train_500_0.50_90.ffcv                           │                                                                                   
│ data.val_dataset         │ /home/data/imagenet_ffcv/jpg50/val_500_0.50_90.ffcv                             │                                                                                   
│ data.num_workers         │ 8                                                                               │                                                                                   
│ data.in_memory           │ 1                                                                               │                                                                                   
│ lr.step_ratio            │ 0.1                                                                             │                                                                                   
│ lr.step_length           │ 30                                                                              │                                                                                   
│ lr.lr_schedule_type      │ cyclic                                                                          │                                                                                   
│ lr.lr                    │ 1.7                                                                             │                                                                                   
│ lr.lr_peak_epoch         │ 2                                                                               │                                                                                   
│ logging.folder           │ /home/experiments/ffcv                                                          │                                                                                   
│ logging.log_level        │ 1                                                                               │                                                                                   
│ validation.batch_size    │ 512                                                                             │                                                                                   
│ validation.resolution    │ 256                                                                             │                                                                                   
│ validation.lr_tta        │ 1                                                                               │                                                                                   
│ training.eval_only       │ 0                                                                               │                                                                                   
│ training.batch_size      │ 512                                                                             │                                                                                   
│ training.optimizer       │ sgd                                                                             │                                                                                   
│ training.momentum        │ 0.9                                                                             │                                                                                   
│ training.weight_decay    │ 0.0001                                                                          │                                                                                   
│ training.epochs          │ 16                                                                              │                                                                                   
│ training.label_smoothing │ 0.1                                                                             │                                                                                   
│ training.distributed     │ 1                                                                               │                                                                                   
│ training.use_blurpool    │ 1                                                                               │                                                                                   
│ dist.world_size          │ 8                                                                               │                                                                                   
│ dist.address             │ localhost                                                                       │                                                                                   
│ dist.port                │ 12355                                                                           │                                                                                   
└──────────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘                                                                                   
Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.                                                                                                                                                                                      
Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.  
Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.
Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.
/home/.conda/envs/ffcv/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
/home/.conda/envs/ffcv/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called 
`full_state_update` that has
                not been set for this class (MeanScalarMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_full_state_property`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
=> Logging in /home/experiments/ffcv/69f9f7b3-1f39-48b6-be91-2242639ef094
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  4.53s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
ep=0, iter=311, shape=(512, 3, 160, 160), lrs=['0.847', '0.847']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [28:08<00:00,  5.41s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.31s/it]
=> Log: {'current_lr': 0.8473609134615385, 'top_1': 0.06341999769210815, 'top_5': 0.17452000081539154, 'val_time': 17.07068157196045, 'train_loss': None, 'epoch': 0}                            
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|███████████████████████████████████████████████████████████████████████████████████████▋| 311/312 [06:18<00:01,  1.50s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
ep=1, iter=311, shape=(512, 3, 160, 160), lrs=['1.697', '1.697']: 100%|████████████████████████████████████████████████████████████████████████████████████████| 312/312 [06:18<00:00,  1.21s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.03it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.03it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:12<00:00,  1.01it/s]
=> Log: {'current_lr': 1.6972759134615385, 'top_1': 0.14985999464988708, 'top_5': 0.35117998719215393, 'val_time': 12.889710664749146, 'train_loss': None, 'epoch': 1}

As you can see, epoch 0 already took 28 minutes to complete (the subsequent epochs took more than 6 minutes but this is still too high). I checked GPU utilization using nvidia-smi: all GPUs were used (and under-used, obviously):

$ nvidia-smi
Sun Nov 13 00:40:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.07    Driver Version: 515.65.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   41C    P0   160W / 400W |  22562MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   43C    P0   160W / 400W |  16599MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   37C    P0   346W / 400W |  16705MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   40C    P0    94W / 400W |  16599MiB / 81920MiB |     74%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   34C    P0    77W / 400W |  16599MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   32C    P0    67W / 400W |  16599MiB / 81920MiB |     47%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0    83W / 400W |  16599MiB / 81920MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   32C    P0    79W / 400W |  16455MiB / 81920MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    806252      C   ...onda/envs/ffcv/bin/python    16449MiB |
|    0   N/A  N/A    806253      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806254      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806255      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806256      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806257      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806258      C   ...onda/envs/ffcv/bin/python      867MiB |
|    0   N/A  N/A    806259      C   ...onda/envs/ffcv/bin/python      867MiB |
|    1   N/A  N/A    806253      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    2   N/A  N/A     18785      G   /usr/libexec/Xorg                  63MiB |
|    2   N/A  N/A     18848      G   /usr/bin/gnome-shell               41MiB |
|    2   N/A  N/A    806254      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    3   N/A  N/A    806255      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    4   N/A  N/A    806256      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    5   N/A  N/A    806257      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    6   N/A  N/A    806258      C   ...onda/envs/ffcv/bin/python    16593MiB |
|    7   N/A  N/A    806259      C   ...onda/envs/ffcv/bin/python    16449MiB |
+-----------------------------------------------------------------------------+

My server has 8 GPU Nvidia A100 (SXM4 80 Go) and 512 GB of RAM, which is comparable to what was used in your experiments.

What should I check to see what went wrong? Could you please try reproducing the above on your side? I guess this would take only a few minutes (the write_imagenet step takes a lot of time but you already have the files so this step can be skipped).

Thank you very much in advance for your response!

libffcv / ffcv