snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
442 stars 15 forks source link

The download script is invalid #3

Open howardgriffin opened 4 months ago

howardgriffin commented 4 months ago

Thank you for your great work! I wonder when will the downloading script be released, the link seems invalid now https://github.com/snap-research/Panda-70M/dataset_dataloading

tsaishien-chen commented 4 months ago

It is available now 🙂

Lyken17 commented 4 months ago

Just notice the authors have updated the instructions.

Or you can also try my implemetation on huggingface, which is more friendly to slurm/k8s-based clusters.

huggingface-cli download Ligeng-Zhu/panda70m \
  --local-dir panda70m --repo-type dataset --local-dir-use-symlinks False

cd panda70m/

# split by shards to accelerate downloading
python main.py --csv=<your csv files> --shards=0 --total=10
python main.py --csv=<your csv files> --shards=1 --total=10
...
python main.py --csv=<your csv files> --shards=9 --total=10
pabl0 commented 4 months ago

Please consider documenting the modifications to "vendored" copy of video2dataset.

It seems to me the dataset .csv files that have Python format (not e.g. JSON) arrays are not compatible with the upstream version of video2dataset without some conversion. Maybe there are other relevant changes as well. video2dataset has been imported in the initial commit, so there are no separate git history describing the modifications.

tsaishien-chen commented 4 months ago

Please consider documenting the modifications to "vendored" copy of video2dataset.

It seems to me the dataset .csv files that have Python format (not e.g. JSON) arrays are not compatible with the upstream version of video2dataset without some conversion. Maybe there are other relevant changes as well. video2dataset has been imported in the initial commit, so there are no separate git history describing the modifications.

Yes, the original video2dataset cannot work on Panda70M csv files. The reason of modification is here: Original video2dataset tool downloads and cuts videos for each row in csv file. But in Panda70M, each row represents a long YouTube video and it would is mre reasonable to download the long video only one time and cut it into multiple video clips. I have added some notes to the readme.

Thanks for raising the issue!

pabl0 commented 4 months ago

Original video2dataset tool downloads and cuts videos for each row in csv file. But in Panda70M, each row represents a long YouTube video and it would is mre reasonable to download the long video only one time and cut it into multiple video clips.

Thanks. Can you clarify this bit more, since I don't think the original video2dataset tool downloads the long video again for each clip either, if that is what you mean.

CSV is not a great format for storing this kind of data. The only benefit it has is slightly smaller file size than JSON (or something like parquet). It can be pretty easily converted to JSON with an array of dicts containing each video URL entry. The caption, tiestamp and matching_score columns needs to be read with something like ast.literal_eval since they are python (with single quote strings). Original video2dataset works just fine with such JSON input file.

I don't think requiring a custom fork of d/l tool to cope with non-standard CSV format of one dataset is optimal.

tsaishien-chen commented 4 months ago

I agree with you. That makes sense. I'll write a script, so the dataset can be downloaded by the original video2dataset tool. Thanks for your suggestions!

Sunrise1111 commented 3 months ago

Just notice the authors have updated the instructions.

Or you can also try my implemetation on huggingface, which is more friendly to slurm/k8s-based clusters.

huggingface-cli download Ligeng-Zhu/panda70m \
  --local-dir panda70m --repo-type dataset --local-dir-use-symlinks False

cd panda70m/

# split by shards to accelerate downloading
python main.py --csv=<your csv files> --shards=0 --total=10
python main.py --csv=<your csv files> --shards=1 --total=10
...
python main.py --csv=<your csv files> --shards=9 --total=10

you are god, bro