pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.95k stars 6.91k forks source link

Assertion error during kinetics400 validation #4839

Closed datumbox closed 2 years ago

datumbox commented 2 years ago

🐛 Describe the bug

Running on main:

torchrun --nproc_per_node=8 train.py --data-path /datasets01/kinetics/070618/400/ --train-dir=val --val-dir=val --batch-size=16 --sync-bn --test-only --pretrained --cache-dataset

throws the following error:

Test:  [2200/3008]  eta: 0:11:24  loss: 2.6703 (2.1475)  acc1: 43.7500 (57.4938)  acc5: 68.7500 (77.8623)  time: 0.9043  data: 0.6405  max mem: 5888
Traceback (most recent call last):
  File "train.py", line 392, in <module>
    main(args)
  File "train.py", line 273, in main
    evaluate(model, criterion, data_loader_test, device=device)
  File "train.py", line 62, in evaluate
    for video, target in metric_logger.log_every(data_loader, 100, header):
  File "/private/home/vvryniotis/vision/references/video_classification/utils.py", line 128, in log_every
    for obj in iterable:
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/_utils.py", line 438, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/private/home/vvryniotis/.conda/envs/datumbox/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/private/home/vvryniotis/vision/torchvision/datasets/kinetics.py", line 233, in __getitem__
    video, audio, info, video_idx = self.video_clips.get_clip(idx)
  File "/private/home/vvryniotis/vision/torchvision/datasets/video_utils.py", line 362, in get_clip
    assert len(video) == self.num_frames, f"{video.shape} x {self.num_frames}"
AssertionError: torch.Size([17, 288, 352, 3]) x 16

If we apply the following patch:

$ git diff
diff --git a/torchvision/datasets/video_utils.py b/torchvision/datasets/video_utils.py
index f0f19e33..2254f8c5 100644
--- a/torchvision/datasets/video_utils.py
+++ b/torchvision/datasets/video_utils.py
@@ -359,8 +359,8 @@ class VideoClips:
                 resampling_idx = resampling_idx - resampling_idx[0]
             video = video[resampling_idx]
             info["video_fps"] = self.frame_rate
-        assert len(video) == self.num_frames, f"{video.shape} x {self.num_frames}"
-        return video, audio, info, video_idx
+        #assert len(video) == self.num_frames, f"{video.shape} x {self.num_frames}"
+        return video[:self.num_frames], audio[:self.num_frames], info, video_idx

     def __getstate__(self):
         video_pts_sizes = [len(v) for v in self.video_pts]

We get an accuracy which is far from the expected one:

Result:
 * Clip Acc@1 56.488 Clip Acc@5 77.773

Expected:
 * Clip Acc@1 57.50 Clip Acc@5 78.81

Questions:

cc @pmeier @fmassa @bjuncek

Versions

Latest main 0817f7f

fmassa commented 2 years ago

Thanks for spotting this @datumbox !

To answer your questions:

datumbox commented 2 years ago

Thanks for the reply Francisco. Given that our reference is basically broken, I bumped the priority.

An alternative that could buy us time is to submit a temporary patch to remove the assertion until this is properly investigated and fixed. I'm not particularly fond of this, but it might be worth considering if the actual fix is complex and requires time. I'll leave @prabhat00155 and @bjuncek to comment on that.

Concerning the accuracies, I feel that the frame issue shouldn't affect them too much. It's unclear how many records have issues but as you can see from the log, it took parsing 75% of the validation data until we find one record that has issues. While working on the multi-weights project, I've ran tests on multiple existing models and I've noticed quite some variation comparing to our documentation (though not as bad as this one). Definitely worth investigating more.

bjuncek commented 2 years ago

In terms of accuracy, this is weird and not expected for a single frame difference; Having said that, many things have changed in the setup (potentially even the files used, iirc, we trained it on a resampled 480 version of kinetics from FAIR cluster which was also hosted on /datasets/kinetics/07062018 but not under 400, but there was another subfolder there).

@datumbox if possible one should a) apply the patch mentioned and b) run it on the Kinetics version that is publicly available (download from torchvision dataset should work fine, and re-run the ref scrips). For the long time, there were many different dataset versions, which were dependent on resampling, region from where the dataset was donwloaded and general dataset degradation due to youtube TOS. Since now for the first time we have a publicly available version of the videos, so let's use this opportunity to update our references.

datumbox commented 2 years ago

@bjuncek Thanks for confirming that we were not using the right dataset. Indeed I can see a 480 version called val_avi-480p. Unfortunately I don't have the bandwidth to run for you complex investigations but I will run the model on top of the 480 version and let you know if the accuracy matches.

datumbox commented 2 years ago

I ran the following and get:

torchrun --nproc_per_node=8 train.py --data-path /datasets01/kinetics/070618/ --train-dir=val_avi-480p --val-dir=val_avi-480p --batch-size=64 --sync-bn --test-only --pretrained --cache-dataset
Acc@1 57.029 Clip Acc@5 78.352

As you can see the results are closer but not identical to the reported numbers. I think this requires additional investigation. Potentially trying all available datasets on DevFAIR to see which one is the right one.

@bjuncek Do you have other information you could share on how the models were trained? Logs? Training paths? Anything can help.

@prabhat00155 I see that you self-assigned the ticket so I assume you plan to investigate. Let me know if you need anything from me.

bjuncek commented 2 years ago

@datumbox I've been able to confirm that I don't run into an error anymore. Could you double check on your end (I just used kinetics400 val set like in the example above)?

datumbox commented 2 years ago

@bjuncek I confirm that this is solved on the latest main. Thanks!