hche11 / VGGSound

VGGSound: A Large-scale Audio-Visual Dataset
http://www.robots.ox.ac.uk/~vgg/data/vggsound/
Other
285 stars 30 forks source link

number of entries in vggsound.csv do not match the test and train split files #19

Open carandraug opened 5 months ago

carandraug commented 5 months ago

The file vggsound.csv file lists 199467 entries. That number does not match the sum of the test and train files. See

$ wc -l data/train.csv data/test.csv 
 183730 data/train.csv
  15446 data/test.csv
 199176 total
$ wc -l data/vggsound.csv 
199467 data/vggsound.csv

The vggsound.csv file have an extra 291 entries. The extra entries are in both the train and test split:

$ python3 -c 'import csv; [print(x[3]) for x in csv.reader(open("data/vggsound.csv"))]' | sort | uniq -c
  15496 test
 183971 train

I happen to have a copy of the file vggsound.csv as downloaded from the VGG website and these numbers matched.

ppx-hub commented 3 months ago

The file vggsound.csv file lists 199467 entries. That number does not match the sum of the test and train files. See

$ wc -l data/train.csv data/test.csv 
 183730 data/train.csv
  15446 data/test.csv
 199176 total
$ wc -l data/vggsound.csv 
199467 data/vggsound.csv

The vggsound.csv file have an extra 291 entries. The extra entries are in both the train and test split:

$ python3 -c 'import csv; [print(x[3]) for x in csv.reader(open("data/vggsound.csv"))]' | sort | uniq -c
  15496 test
 183971 train

I happen to have a copy of the file vggsound.csv as downloaded from the VGG website and these numbers matched.

I checked the full video compression package provided by the author in here and the total number of videos after decompression is 199,176, which is consistent with the number in the training and test files. I think vggsound.csv does have an extra 291 video files.