Closed SeonghaEom closed 10 months ago
Relevant Issue: https://superuser.com/questions/1661649/how-to-stop-aria2-from-renaming-html-files
Solution 1. Specify the output filename with -o option.
for i in {0000..2314}; do aria2c -x 16 -o metadata_$i.parquet https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done
Solution 2. Use another downloader like wget. You can simply replace
for i in {0000..2314}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done
with
for i in {0000..2314}; do wget https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done
However, it is about 3x slower in my case.
I guess there should be more general solution than the above ones. I hope it can solve your problem.
Thank you for the reply. :)
Hi, I was following the instruction from this link, which is for downloading laion5b_h14 embeddings for building local backend.
But I am stuck with step 6, which is downloading embeddings in parquet file from huggingface. (I was currently downloading only en-embeddings , without other embeddings), because the next step gives me an error message that the input data(parquet file) is empty.
The file seems weird since the format does not end with <.parquet> as well.
How can I solve this issue? Thanks in advance.