snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

The error occurs when downloading with the argument "--clip_col='timestamp'". #10

Open peiliu0408 opened 4 months ago

peiliu0408 commented 4 months ago

Before downloading the dataset, an experiment was conducted using 200 videos. While these 200 videos could be downloaded (I got 193 videos finally) correctly without the argument "--clip_col="timestamp".

But an error occurred when this argument was included. In the subsequent experiment, none of the videos were downloaded correctly.

截屏2024-03-07 16 51 31 截屏2024-03-07 16 52 05

I am new to this tool, so I am unsure if there might be any formatting errors in the timestamp.

pabl0 commented 4 months ago

It seems to me you are using the upstream video2dataset tool instead of the one included in this repository, which does not work with the provided .csv files that includes the time stamps as Python arrays which are not parsed properly by the normal video2dataset, and gets passed as a string in clip_spans.

If you want to use the standard video2dataset tool, you might want to convert the csv files to proper JSON. And it would be nice if this repo documented the changes to "vendored" copy of video2dataset and made it clear that the normal command does not work. Maybe also consider publishing the metadata in better format than csv.

tsaishien-chen commented 4 months ago

@peiliu0408 @pabl0 Yes, it seems you are using video2dataset for downloading, but it cannot work on Panda70M csv files. Please check here for the reason that we need to modify video2dataset tool. And please try to use the video2dataset in this repo to download the dataset.

peiliu0408 commented 4 months ago

It seems to me you are using the upstream video2dataset tool instead of the one included in this repository, which does not work with the provided .csv files that includes the time stamps as Python arrays which are not parsed properly by the normal video2dataset, and gets passed as a string in clip_spans.

If you want to use the standard video2dataset tool, you might want to convert the csv files to proper JSON. And it would be nice if this repo documented the changes to "vendored" copy of video2dataset and made it clear that the normal command does not work. Maybe also consider publishing the metadata in better format than csv.

Thanks for your reply! Just to confirm, I installed the official video2dataset version and now I'd like to reinstall the tool from this repository

tsaishien-chen commented 4 months ago

Yes, please try to uninstall the original video2dataset and reinstall our version!

peiliu0408 commented 4 months ago

Yes, please try to uninstall the original video2dataset and reinstall our version!

I have some questions regarding how to set the optimal processes_count and thread_count in the config file for a machine with 8 cores and 16GB of memory. Can you help me with this?