CIFAR example not training at all

aph-naph commented 1 year ago

I am new to FFCV and wanted to try the CIFAR example to see how well it goes so that I can use it for my research work. But I am unfortunate to find that the example does not train the model and instead constantly predicts a 10% accuracy. I tried changing the example code simply to perform evaluation during training (to print test loss while training so as to notice the improvements), and we can see that it does not show any improvement in the iterations. It always prints 10% test accuracy. It seems that there are no gradient updates..

I have not modified the example code and have provided the diffs to default config and code below for the minor changes.

default_config.yaml

diff --git a/examples/cifar/default_config.yaml b/examples/cifar/default_config.yaml
index 2afe5c4..f13a05e 100644
--- a/examples/cifar/default_config.yaml
+++ b/examples/cifar/default_config.yaml
@@ -1,8 +1,8 @@
 data:
   gpu: 0
-  num_workers: 8
-  train_dataset: /tmp/cifar_train.beton
-  val_dataset: /tmp/cifar_test.beton
+  num_workers: 16
+  train_dataset: ./cifar10/cifar_train.beton
+  val_dataset: ./cifar10/cifar_test.beton
 training:
   batch_size: 512
   epochs: 24
@@ -13,4 +13,4 @@ training:
   weight_decay: 5e-4
   label_smoothing: 0.1
   lr_tta: true
-  num_workers: 8
\ No newline at end of file
+  num_workers: 16

train_cifar.py

diff --git a/examples/cifar/train_cifar.py b/examples/cifar/train_cifar.py
index c46b1a3..206262d 100644
--- a/examples/cifar/train_cifar.py
+++ b/examples/cifar/train_cifar.py
@@ -174,12 +174,14 @@ def train(model, loaders, lr=None, epochs=None, label_smoothing=None,
             scaler.step(opt)
             scaler.update()
             scheduler.step()
+        evaluate(model, loaders, test_only=True)

 @param('training.lr_tta')
-def evaluate(model, loaders, lr_tta=False):
+def evaluate(model, loaders, lr_tta=False, test_only=False):
+    modes = ['test'] if test_only else ['train', 'test']
     model.eval()
     with ch.no_grad():
-        for name in ['train', 'test']:
+        for name in modes:
             total_correct, total_num = 0., 0.
             for ims, labs in tqdm(loaders[name]):
                 with autocast():
@@ -188,6 +190,7 @@ def evaluate(model, loaders, lr_tta=False):
                         out += model(ims.flip(-1))
                     total_correct += out.argmax(1).eq(labs).sum().cpu().item()
                     total_num += ims.shape[0]
+            print(f"{name} stats: total correct: {total_correct}, total num: {total_num}")
             print(f'{name} accuracy: {total_correct / total_num * 100:.1f}%')

I have also run the example by default without any changes which gave the same results. (both train set and test set evaluation which was performed after the training was completed, printed only 10% accuracy).

PC Configuration CPU: Intel i7-10700 (16) @ 4.800GHz GPU: NVIDIA GeForce GTX 1660 SUPER Memory: 32 GB OS: Ubuntu 20.04.4 LTS x86_64

Would love your help as I was very excited to try this library

aph-naph commented 1 year ago

The problem is due to null outputs in the model.. I'm not sure why the default model is giving null outputs, but looking more into it, the output from the first layer itself is null, so I guess the input must be causing the problem.. I am using the default example code which processes images from CIFAR using FFCV and passes it to the model.. This seems to show that the processed output from ffcv could be causing the error?

@andrewilyas tagging as this could be critical for new user acquisition since the default example is not working as expected..

andrewilyas commented 1 year ago

Hi @aph-naph ! Can you post your PyTorch and Python configurations?

aph-naph commented 1 year ago

Hi @andrewilyas Thank you for looking into this

PyTorch versions:
torch: 1.12.1
torchvision: 0.13.1
pytorch-pfn-extras: 0.5.8

Python version: 3.9.13
FFCV version: 0.0.3

The conda environment was created using the following commands mentioned in the README

conda create -y -n ffcv python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge
conda activate ffcv
pip install ffcv

Happy to share further details !

aph-naph commented 1 year ago

Hi, did you get a chance to reproduce the issue? If there is a docker image available, I'll be glad to use it instead since I'm not able to get the package working locally. Is there one available? Any ffcv version is fine @andrewilyas

andrewilyas commented 1 year ago

Hi @aph-naph ! Sorry for the late reply. Do you think you could try this docker instance and see if it works?

https://github.com/libffcv/ffcv/blob/main/docker/Dockerfile

aph-naph commented 1 year ago

I ran it in docker and it unfortunately gives the same results as the attached screeenshot :( Does the example work for you locally @andrewilyas? I'm wondering if everyone else is having the same problem or if it's just me..

andrewilyas commented 1 year ago

It works locally for me for sure---did you try regenerating the dataset files?

aph-naph commented 1 year ago

Yes, I've tried regenerating the dataset files by deleting /tmp/* and re-running write_datasets.py, and it produces the same errors :( Could you please share the PyTorch and Python configuration of the environment in which running train_cifar.sh after cloning the repo freshly works as expected @andrewilyas? I will switch to that environment and give it a try

andrewilyas commented 1 year ago

Hi @aph-naph ! I know this is super late, but I finally figured out the reason for this bug---it was a bug in the RandomTranslate augmentation that was present on the main branch but not the dev branch I was using. It's now fixed on main, and in FFCV 1.0.1, released today!

libffcv / ffcv

CIFAR example not training at all #246