Open pengzhiliang opened 2 months ago
Upon further inspection of the 00000_stats.json file, it was expected to find 98 unique long videos within the subdirectory 00000/, each with a distinct prefix (key). However, my observations contradict this expectation:
xxx: /mnt/data/pandam70m/data/part_00001# ls 00000/*.mp4
0000009_00000.mp4
0000009_00001.mp4
0000013_00000.mp4
0000013_00001.mp4
...
...
...
0000094_00003.mp4
0000094_00004.mp4
xxx: /mnt/data/pandam70m/data/part_00001# ls *.mp4 | cut -d'_' -f1 | sort | uniq | wc -l
21 # To find out the number of unique prefixes for the MP4 files, the filenames are processed to extract the part before the underscore (_).
The actual count of unique prefixes is only 21, which is substantially lower than the expected 98. This discrepancy leads me to question whether the downloaded long videos were deleted before they could be segmented into shorter clips. Could there be an issue with the processing or storage logic that is causing the complete videos to be removed prematurely?
I would greatly appreciate any insights or suggestions you might have to resolve this matter. Thank you for your time and assistance.
Hi @pengzhiliang, Thanks for your interest in our dataset! Did you notice any errors or warning messages during downloading? Error messages like "...private video ..." or "...Skipping player responses..." are fine, but the messages other than them should not appear.
Hello,
I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.
Here are the details:
I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the
files
output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).
The Bash output is as follows:
Here is my configuration:
Would you be able to help me analyze what might be causing this issue? Your assistance would be greatly appreciated.