rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Issues downloading parquet file from huggingface #279

Closed SeonghaEom closed 10 months ago

SeonghaEom commented 1 year ago

Hi, I was following the instruction from this link, which is for downloading laion5b_h14 embeddings for building local backend.

But I am stuck with step 6, which is downloading embeddings in parquet file from huggingface. (I was currently downloading only en-embeddings , without other embeddings), because the next step gives me an error message that the input data(parquet file) is empty.

The file seems weird since the format does not end with <.parquet> as well. image

How can I solve this issue? Thanks in advance.

yongsubaek commented 1 year ago

Relevant Issue: https://superuser.com/questions/1661649/how-to-stop-aria2-from-renaming-html-files

Solution 1. Specify the output filename with -o option.

for i in {0000..2314}; do aria2c -x 16 -o metadata_$i.parquet https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done

Solution 2. Use another downloader like wget. You can simply replace

for i in {0000..2314}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done

with

for i in {0000..2314}; do wget https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet; done

However, it is about 3x slower in my case.

I guess there should be more general solution than the above ones. I hope it can solve your problem.

SeonghaEom commented 1 year ago

Thank you for the reply. :)