Closed aph-naph closed 1 year ago
The problem is due to null outputs in the model.. I'm not sure why the default model is giving null outputs, but looking more into it, the output from the first layer itself is null, so I guess the input must be causing the problem.. I am using the default example code which processes images from CIFAR using FFCV and passes it to the model.. This seems to show that the processed output from ffcv could be causing the error?
@andrewilyas tagging as this could be critical for new user acquisition since the default example is not working as expected..
Hi @aph-naph ! Can you post your PyTorch and Python configurations?
Hi @andrewilyas Thank you for looking into this
PyTorch versions:
torch: 1.12.1
torchvision: 0.13.1
pytorch-pfn-extras: 0.5.8
Python version: 3.9.13
FFCV version: 0.0.3
The conda environment was created using the following commands mentioned in the README
conda create -y -n ffcv python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge
conda activate ffcv
pip install ffcv
Happy to share further details !
Hi, did you get a chance to reproduce the issue? If there is a docker image available, I'll be glad to use it instead since I'm not able to get the package working locally. Is there one available? Any ffcv version is fine @andrewilyas
Hi @aph-naph ! Sorry for the late reply. Do you think you could try this docker instance and see if it works?
I ran it in docker and it unfortunately gives the same results as the attached screeenshot :( Does the example work for you locally @andrewilyas? I'm wondering if everyone else is having the same problem or if it's just me..
It works locally for me for sure---did you try regenerating the dataset files?
Yes, I've tried regenerating the dataset files by deleting /tmp/* and re-running write_datasets.py, and it produces the same errors :( Could you please share the PyTorch and Python configuration of the environment in which running train_cifar.sh after cloning the repo freshly works as expected @andrewilyas? I will switch to that environment and give it a try
Hi @aph-naph ! I know this is super late, but I finally figured out the reason for this bug---it was a bug in the RandomTranslate augmentation that was present on the main branch but not the dev branch I was using. It's now fixed on main
, and in FFCV 1.0.1, released today!
I am new to FFCV and wanted to try the CIFAR example to see how well it goes so that I can use it for my research work. But I am unfortunate to find that the example does not train the model and instead constantly predicts a 10% accuracy. I tried changing the example code simply to perform evaluation during training (to print test loss while training so as to notice the improvements), and we can see that it does not show any improvement in the iterations. It always prints 10% test accuracy. It seems that there are no gradient updates..
I have not modified the example code and have provided the diffs to default config and code below for the minor changes.
default_config.yaml
train_cifar.py
I have also run the example by default without any changes which gave the same results. (both train set and test set evaluation which was performed after the training was completed, printed only 10% accuracy).
PC Configuration CPU: Intel i7-10700 (16) @ 4.800GHz GPU: NVIDIA GeForce GTX 1660 SUPER Memory: 32 GB OS: Ubuntu 20.04.4 LTS x86_64
Would love your help as I was very excited to try this library