maciej-sypetkowski / kaggle-rcic-1st

1st Place Solution for Kaggle Recursion Cellular Image Classification Challenge -- https://www.kaggle.com/c/recursion-cellular-image-classification/
MIT License
140 stars 40 forks source link

ValueError: cannot convert float NaN to integer #6

Closed evan-wehi closed 3 years ago

evan-wehi commented 3 years ago

After several hours of training with default options and after remediating the sirna coding issue, I get

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Traceback (most recent call last):
  File "main.py", line 504, in <module>
    main(args)
  File "main.py", line 492, in main
    train(args, model)
  File "main.py", line 444, in train
    for i, (X, S, _, Y) in enumerate(train_loader):
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/stornext/HPCScratch/home/thomas.e/.conda/envs/rxrx/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/vast/scratch/users/thomas.e/kaggle-rcic-1st/dataset.py", line 317, in __getitem__
    r.append(torch.tensor(d[-1], dtype=torch.long))
ValueError: cannot convert float NaN to integer

Is it a bad value in one of the CSV files?

maciej-sypetkowski commented 3 years ago

after remediating the sirna coding issue

You're referring to https://github.com/maciej-sypetkowski/kaggle-rcic-1st/issues/4 , right?

Yes, it seems possible that sirna column is either missing or doesn't contain the integer number. Check also control csv, including test_controls.csv

evan-wehi commented 3 years ago

There were missing values in train.csv and test.csv even after I fixed the sirna ID because there were entries in those files that aren't in the metadata from from RXRX. ¯_(ツ)_/¯