snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

Debug requst #39

Open syguan96 opened 3 months ago

syguan96 commented 3 months ago

Sorry to bother you. I only downloaded 653 videos of the test split. So I tried to debug. May I ask whether you met this problem?

Failed to delete the file: /tmp/3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4. Error: [Errno 2] No such file or directory: '/tmp/3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4'
syguan96 commented 3 months ago

I can found 3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4.mkv in \tmp

kuno989 commented 3 months ago

I have a bug like this too, but I can't find the cause yet. haha

Zhidong-Gao commented 3 months ago

I got same error, it was caused due to the mismatch between downloaded file and pre-defined name (yt_dlp will add extra extension .mkv after the original name)

I solve the problem by removing the mp4 extension and filtering out the matched files in the cache folder,

Below is my modification: dataset_dataloading/video2dataset/video2dataset/data_reader.py,

line 214 original: video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}.mp4" now: video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}"

line 269-271 original:

with portalocker.Lock(modality_path, 'rb', timeout=180) as locked_file:
    streams[modality] = locked_file.read()
os.remove(modality_path)

now:

matching_files = glob.glob(modality_path+'*')
with portalocker.Lock(matching_files[0], 'rb', timeout=180) as locked_file:
    streams[modality] = locked_file.read()
for file in matching_files:
    os.remove(file)

the above solution works for me but its not perfect, hope the authors could fix this bug

tsaishien-chen commented 3 months ago

Hi @syguan96, @kuno989, @Zhidong-Gao, Thanks for the interest about the dataset! Also thanks for letting me know the bug. I have dig into the problems and found the bugs are caused from: no extension limit for the audio. So if the audio is not downloaded in mp4 format, the downloaded video will get double extension. To fix that, please replace the format here https://github.com/snap-research/Panda-70M/blob/6ec1ca4d4807804633d22147708964e353d4aa77/dataset_dataloading/video2dataset/video2dataset/data_reader.py#L181-L185 to this one:

    video_format_string = (
        f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"bv[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}"
    )

This should help. I'll also update the code soon to this repo. By the way, for the solution from @Zhidong-Gao, you will miss lots of samples if doing so, so I strongly recommend you to follow the above steps to fix this issue. Please let me know if there is any problem!