iejMac / video2dataset

Easily create large video dataset from video urls
MIT License
533 stars 65 forks source link

Clean up tmp part files in case of d/l failure #304

Open pabl0 opened 8 months ago

pabl0 commented 8 months ago

If YoutubeDL fails to fully download a video, often times it leaves .part and .part-FragNN files around in the tmp directory. For large datasets, these can consume a significant amount of disk space, unless you have something cleaning the files.

The code deletes the actual video .mp4 in exception handling, but often times the transfer was not completed and the actual .mp4 file has not been created yet:

https://github.com/iejMac/video2dataset/blob/83afef059ba1a29eb92bb6cb922f1f8e0ffd5965/video2dataset/data_reader.py#L230-L232

Please consider adding cleaning of the .part(-Frag*) files in case of exception. Since an unique name (UUID) is used for the file, it can't be used if another attempt to run video2dataset is done later. But if some retry logic is added to downloading (which would be nice actually), then I believe yt-dlp might be able to reuse the temporary parts and resume download (I believe that's why yt-dlp itself does not clean up them, but I am not entirely sure if such resume is possible).