Open syguan96 opened 3 months ago
I can found 3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4.mkv
in \tmp
I have a bug like this too, but I can't find the cause yet. haha
I got same error, it was caused due to the mismatch between downloaded file and pre-defined name (yt_dlp will add extra extension .mkv after the original name)
I solve the problem by removing the mp4 extension and filtering out the matched files in the cache folder,
Below is my modification: dataset_dataloading/video2dataset/video2dataset/data_reader.py,
line 214
original:
video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}.mp4"
now:
video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}"
line 269-271 original:
with portalocker.Lock(modality_path, 'rb', timeout=180) as locked_file:
streams[modality] = locked_file.read()
os.remove(modality_path)
now:
matching_files = glob.glob(modality_path+'*')
with portalocker.Lock(matching_files[0], 'rb', timeout=180) as locked_file:
streams[modality] = locked_file.read()
for file in matching_files:
os.remove(file)
the above solution works for me but its not perfect, hope the authors could fix this bug
Hi @syguan96, @kuno989, @Zhidong-Gao, Thanks for the interest about the dataset! Also thanks for letting me know the bug. I have dig into the problems and found the bugs are caused from: no extension limit for the audio. So if the audio is not downloaded in mp4 format, the downloaded video will get double extension. To fix that, please replace the format here https://github.com/snap-research/Panda-70M/blob/6ec1ca4d4807804633d22147708964e353d4aa77/dataset_dataloading/video2dataset/video2dataset/data_reader.py#L181-L185 to this one:
video_format_string = ( f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/" f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/" f"bv[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/" f"b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}" )
This should help. I'll also update the code soon to this repo. By the way, for the solution from @Zhidong-Gao, you will miss lots of samples if doing so, so I strongly recommend you to follow the above steps to fix this issue. Please let me know if there is any problem!
Sorry to bother you. I only downloaded 653 videos of the test split. So I tried to debug. May I ask whether you met this problem?