snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

Open pengzhiliang opened 2 months ago

pengzhiliang commented 2 months ago

Hello,

I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.

Here are the details:

I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the files output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).

However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).

The Bash output is as follows:

xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.mp4 |wc -l  
223  

xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.json |wc -l  
225  # (success rate of clips is also high)

xxx: /mnt/data/pandam70m/data/part_00001# cat 00000_stats.json  
{  
    "count": 100,  
    "successes": 98,  
    "failed_to_download": 2,  
    "failed_to_subsample": 0,  
    "duration": 745.6420252323151,  
    "bytes_downloaded": 250422000,  
    "start_time": 1712764675.583739,  
    "end_time": 1712765421.2257643,  
    "status_dict": {  
        "success": 98,  
        "[Errno 2] No such file or directory: '/tmp/7ed953b1-50ca-4631-a2cf-5d388d2ad70a.mp4'": 1,  
        "[Errno 2] No such file or directory: '/tmp/578de3a9-0df1-4941-8aaf-1ccd29563093.mp4'": 1  
    }  
}  

Here is my configuration:

subsampling: {}  

reading:  
    yt_args:  
        download_size: 360  
        download_audio: True  
        yt_metadata_args:  
            writesubtitles:  True  
            subtitleslangs: ['en']  
            writeautomaticsub: True  
            get_info: True  
    timeout: 60  
    sampler: null  

storage:  
    number_sample_per_shard: 100  
    oom_shard_count: 5  
    captions_are_subtitles: False  

distribution:  
    processes_count: 1  
    thread_count: 2  
    subjob_size: 10000  
    distributor: "multiprocessing"  

Would you be able to help me analyze what might be causing this issue? Your assistance would be greatly appreciated.

pengzhiliang commented 2 months ago

Upon further inspection of the 00000_stats.json file, it was expected to find 98 unique long videos within the subdirectory 00000/, each with a distinct prefix (key). However, my observations contradict this expectation:

xxx: /mnt/data/pandam70m/data/part_00001# ls 00000/*.mp4   
0000009_00000.mp4  
0000009_00001.mp4  
0000013_00000.mp4  
0000013_00001.mp4  
...  
...  
...  
0000094_00003.mp4  
0000094_00004.mp4  

xxx: /mnt/data/pandam70m/data/part_00001# ls *.mp4 | cut -d'_' -f1 | sort | uniq | wc -l  
21 # To find out the number of unique prefixes for the MP4 files, the filenames are processed to extract the part before the underscore (_).  

The actual count of unique prefixes is only 21, which is substantially lower than the expected 98. This discrepancy leads me to question whether the downloaded long videos were deleted before they could be segmented into shorter clips. Could there be an issue with the processing or storage logic that is causing the complete videos to be removed prematurely?

I would greatly appreciate any insights or suggestions you might have to resolve this matter. Thank you for your time and assistance.

tsaishien-chen commented 2 months ago

Hi @pengzhiliang, Thanks for your interest in our dataset! Did you notice any errors or warning messages during downloading? Error messages like "...private video ..." or "...Skipping player responses..." are fine, but the messages other than them should not appear.