snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

[Errno 2] No such file or directory: '/tmp/395ca4ed-4ed1-4c27-b9a1-f9e0e9294c16.mp4' #9

Open qihao067 opened 4 months ago

qihao067 commented 4 months ago

The first time I ran the code, it worked fine, but it was slow, so I stopped and changed the 'processes_count' in the cfg file. However, when I executed the video2dataset command for the second time, it started to give me this error, and these videos were skipped, with only the JSON and txt files being saved. I believe these videos were downloaded the first time I ran the code, but I cleaned the tmp file. I cannot find the code to redownload these videos. Can you help me fix this? Thanks

error: [Errno 2] No such file or directory: '/tmp/a7191cb8-d948-48b1-ba70-1e890953518b.mp4' [Errno 2] No such file or directory: '/tmp/446b7a87-a878-4b06-8df0-88facffb3d24.mp4' [Errno 2] No such file or directory: '/tmp/a866306e-8405-428b-9455-3584856f5fbb.mp4' [Errno 2] No such file or directory: '/tmp/2e93b9f5-a683-4b09-b2a7-13631b7def87.mp4' [Errno 2] No such file or directory: '/tmp/609b1ad1-4eee-4e0e-a6e1-91eb853d95b8.mp4' .....

pabl0 commented 4 months ago

I think this is the normal behavior of the upstream video2dataset tool as well: that is the error recorded in the stats.json file no matter what was the reason for failed download. It is something that I'd like to see improved to work more like the img2dataset counterpart, to record the actual failure reason like HTTP 403 etc.

With large datasets using public sources like Youtube, 100% success rate is not really possible. When you stop the download, some temporary files are left behind. Since they use an unique name based on UUID, I don't think they will be reused, so you can clean them up before starting a new attempt.

qihao067 commented 3 months ago

Thanks for the swift response. But the problem is, if the code finds the videos downloaded and tries to find them in the /tmp file, then even it fails, it will not download the video again. (Not sure)

Specifically, the first time I downloaded the video, everything was fine. But in the second time, most of the video were not downloaded. The data file looks like this:

Screenshot 2024-03-08 at 5 50 21 PM

This is the json file:

Screenshot 2024-03-08 at 5 53 02 PM
wowfingerlicker commented 3 months ago

Is it possible that my IP has been rate-limited by YouTube?

ChdDongyang commented 3 months ago

I encountered the same problem, have you solved it?

gw00259532 commented 3 months ago

Me too. All the files downloaded are json and txt

nankepan commented 2 months ago

Thanks for the swift response. But the problem is, if the code finds the videos downloaded and tries to find them in the /tmp file, then even it fails, it will not download the video again. (Not sure)

Specifically, the first time I downloaded the video, everything was fine. But in the second time, most of the video were not downloaded. The data file looks like this:

Screenshot 2024-03-08 at 5 50 21 PM

This is the json file:

Screenshot 2024-03-08 at 5 53 02 PM

Hi, I encountered the same problem, how you solved it?

lzhangbj commented 1 month ago

Same error here. Once the error happens, the corresponding shard will stop downloading and the program hangs there. Is there any solution @tsaishien-chen ? Thank you.

tsaishien-chen commented 1 month ago

Hi @lzhangbj, Thanks for your interest on Panda-70M dataset! I am guessing your server's IP might be blocked. Where is your server? When you click the html in the CSV file, can you access to the video source? have you tried to download the videos with different proxy/IP?