Open howardgriffin opened 8 months ago
It is available now 🙂
Just notice the authors have updated the instructions.
Or you can also try my implemetation on huggingface, which is more friendly to slurm/k8s-based clusters.
huggingface-cli download Ligeng-Zhu/panda70m \
--local-dir panda70m --repo-type dataset --local-dir-use-symlinks False
cd panda70m/
# split by shards to accelerate downloading
python main.py --csv=<your csv files> --shards=0 --total=10
python main.py --csv=<your csv files> --shards=1 --total=10
...
python main.py --csv=<your csv files> --shards=9 --total=10
Please consider documenting the modifications to "vendored" copy of video2dataset.
It seems to me the dataset .csv files that have Python format (not e.g. JSON) arrays are not compatible with the upstream version of video2dataset without some conversion. Maybe there are other relevant changes as well. video2dataset has been imported in the initial commit, so there are no separate git history describing the modifications.
Please consider documenting the modifications to "vendored" copy of video2dataset.
It seems to me the dataset .csv files that have Python format (not e.g. JSON) arrays are not compatible with the upstream version of video2dataset without some conversion. Maybe there are other relevant changes as well. video2dataset has been imported in the initial commit, so there are no separate git history describing the modifications.
Yes, the original video2dataset cannot work on Panda70M csv files. The reason of modification is here: Original video2dataset tool downloads and cuts videos for each row in csv file. But in Panda70M, each row represents a long YouTube video and it would is mre reasonable to download the long video only one time and cut it into multiple video clips. I have added some notes to the readme.
Thanks for raising the issue!
Original video2dataset tool downloads and cuts videos for each row in csv file. But in Panda70M, each row represents a long YouTube video and it would is mre reasonable to download the long video only one time and cut it into multiple video clips.
Thanks. Can you clarify this bit more, since I don't think the original video2dataset tool downloads the long video again for each clip either, if that is what you mean.
CSV is not a great format for storing this kind of data. The only benefit it has is slightly smaller file size than JSON (or something like parquet). It can be pretty easily converted to JSON with an array of dicts containing each video URL entry. The caption, tiestamp and matching_score columns needs to be read with something like ast.literal_eval
since they are python (with single quote strings). Original video2dataset works just fine with such JSON input file.
I don't think requiring a custom fork of d/l tool to cope with non-standard CSV format of one dataset is optimal.
I agree with you. That makes sense. I'll write a script, so the dataset can be downloaded by the original video2dataset tool. Thanks for your suggestions!
Just notice the authors have updated the instructions.
Or you can also try my implemetation on huggingface, which is more friendly to slurm/k8s-based clusters.
huggingface-cli download Ligeng-Zhu/panda70m \ --local-dir panda70m --repo-type dataset --local-dir-use-symlinks False cd panda70m/ # split by shards to accelerate downloading python main.py --csv=<your csv files> --shards=0 --total=10 python main.py --csv=<your csv files> --shards=1 --total=10 ... python main.py --csv=<your csv files> --shards=9 --total=10
you are god, bro
Thank you for your great work! I wonder when will the downloading script be released, the link seems invalid now
https://github.com/snap-research/Panda-70M/dataset_dataloading