If YoutubeDL fails to fully download a video, often times it leaves .part and .part-FragNN files around in the tmp directory. For large datasets, these can consume a significant amount of disk space, unless you have something cleaning the files.
The code deletes the actual video .mp4 in exception handling, but often times the transfer was not completed and the actual .mp4 file has not been created yet:
Please consider adding cleaning of the .part(-Frag*) files in case of exception. Since an unique name (UUID) is used for the file, it can't be used if another attempt to run video2dataset is done later. But if some retry logic is added to downloading (which would be nice actually), then I believe yt-dlp might be able to reuse the temporary parts and resume download (I believe that's why yt-dlp itself does not clean up them, but I am not entirely sure if such resume is possible).
If YoutubeDL fails to fully download a video, often times it leaves .part and .part-FragNN files around in the tmp directory. For large datasets, these can consume a significant amount of disk space, unless you have something cleaning the files.
The code deletes the actual video .mp4 in exception handling, but often times the transfer was not completed and the actual .mp4 file has not been created yet:
https://github.com/iejMac/video2dataset/blob/83afef059ba1a29eb92bb6cb922f1f8e0ffd5965/video2dataset/data_reader.py#L230-L232
Please consider adding cleaning of the .part(-Frag*) files in case of exception. Since an unique name (UUID) is used for the file, it can't be used if another attempt to run video2dataset is done later. But if some retry logic is added to downloading (which would be nice actually), then I believe yt-dlp might be able to reuse the temporary parts and resume download (I believe that's why yt-dlp itself does not clean up them, but I am not entirely sure if such resume is possible).