rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.38k stars 208 forks source link

Do colab to show how end2end works #75

Closed rom1504 closed 9 months ago

ig-perez commented 2 years ago

Hi, thanks for building this and sharing it. I just want to know if you were able to make any progress with this? I want to run some inference using COCO with a notebook but I having a hard time to make it work.

I have tried with Kaggle downloading the COCO dataset but it seems Kaggle killed the process at some point and no tar file was uncompressed. The quota limit is around 20GB, the process was cancelled when the disk was about 15GB full (after ~2h).

Last logged message was: total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 96 - count: 481753

Edit: After this I tried: !clip-retrieval inference --input_dataset "./mscoco" --output_folder "./embeddings" but I got an error in line 69 of clip_inference. keys is None. Perhaps because the downloading/preprocessing didn't finish?

Do you think there might be a way to cap the number of examples to download? perhaps to 200K or something like that?

I'm right now trying with Colab, but the runtime is constantly disconnecting and reconnecting, not sure if that affects the downloading process. I'll comment here after it finishes.

Edit: Right now I'm in:

!img2dataset \
--url_list /content/sample_data/mscoco.parquet \
--input_format "parquet" \
--url_col "URL" \
--caption_col "TEXT" \
--output_format webdataset \
--output_folder /content/sample_data/mscoco \
--processes_count 16 \
--thread_count 64 \
--image_size 256 \
--enable_wandb False

Thanks!

rom1504 commented 2 years ago

Indeed these environments are quite unstable Maybe you could try locally? Inference will be slower but it might be ok on a small dataset You can also take only a small subset of the coco input file to start with

ig-perez commented 2 years ago

I can try locally (CPU only). BTW, how can I take a small subset of coco?

rom1504 commented 2 years ago
import pandas as pd
df = pd.read_parquet("mscoco.parquet")
df = df [:10000]
df.to_parquet("small.parquet")
ig-perez commented 2 years ago

Awesome, thanks and sorry for the basic question. I'm new on the vision domain.

rom1504 commented 2 years ago

Glad that this repo is helping! I'm trying to make everything work both at very large scale and at low scale. It seems there's still some work to do on the low scale part. I'll see what I can provide in term of even smaller dataset so it works best as a quick start on colab.

ig-perez commented 2 years ago

Thanks Romain. Low scale will indeed be useful for research and learning.

ig-perez commented 2 years ago

Hey @rom1504 Colab took about 2h but it seems it downloaded all images. This is the latest logged output:

60it [1:09:18, 69.30s/it]
worker  - success: 0.000 - failed to download: 0.000 - failed to resize: 1.000 - images per sec: 13 - count: 10000
total   - success: 0.000 - failed to download: 0.000 - failed to resize: 1.000 - images per sec: 144 - count: 591753

Nevertheless when running !clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings"

I get:

/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
  warnings.warn("PyTorch version 1.7.1 or higher is recommended")
Traceback (most recent call last):
  File "/usr/local/bin/clip-retrieval", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/cli.py", line 21, in main
    "front": clip_front,
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 323, in clip_inference
    dataset = get_image_dataset()(preprocess, input_dataset, enable_text=enable_text, enable_image=enable_image)
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 69, in __init__
    self.keys = list(keys)
TypeError: 'NoneType' object is not iterable

I saw the latest stats JSON file (/content/sample_data/mscoco/00059_stats.json) and contains this:

{
    "count": 1753,
    "successes": 0,
    "failed_to_download": 0,
    "failed_to_resize": 1753,
    "duration": 149.17603397369385,
    "start_time": 1644969973.1418736,
    "end_time": 1644970122.3179076,
    "status_dict": {
        "module 'albumentations' has no attribute 'longest_max_size'": 1753
    }
}

I updated the library's version from 0.1.12 to the latest one (1.1.0), but the problem is still happening. I imagine this is because that library is used during the preprocessing right? If that is the case I'll update it prior downloading and preprocessing tomorrow and see if it works.

I'll keep you posted.

ig-perez commented 2 years ago

Update: Even after updating albumentations to its latest version (it also required to update opencv-python) I get the same error (TypeError: 'NoneType' object is not iterable at clip_inference.py, line 69) when trying to run !clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings".

You can check the notebook here

I'll appreciate your support.

Thanks!

rom1504 commented 2 years ago

did you rerun img2dataset and did it say 100% success ?

ig-perez commented 2 years ago

Previously I run:

!img2dataset \
--url_list /content/sample_data/mscoco.parquet \
--input_format "parquet" \
--url_col "URL" \
--caption_col "TEXT" \
--output_format webdataset \
--output_folder /content/sample_data/mscoco \
--processes_count 16 \
--thread_count 64 \
--image_size 256 \
--enable_wandb False

The last logged output was:

60it [1:49:22, 109.37s/it]
worker  - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total   - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 85 - count: 551753
worker  - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total   - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 86 - count: 561753
worker  - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total   - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 88 - count: 571753
worker  - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total   - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 89 - count: 581753
worker  - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total   - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 91 - count: 591753

There was no 100% success message, but it seems it finished. Is there a way to check the img2dataset process ended correctly?

The content of the latest stats JSON file (/content/sample_data/mscoco/00059_stats.json) is:

{
    "count": 1753,
    "successes": 1753,
    "failed_to_download": 0,
    "failed_to_resize": 0,
    "duration": 234.95621633529663,
    "start_time": 1645025798.0262048,
    "end_time": 1645026032.9824212,
    "status_dict": {
        "success": 1753
    }
}
rom1504 commented 2 years ago

Ok, what is the full error printed by clip inference?

On Wed, Feb 16, 2022, 18:12 Iván G. Pérez @.***> wrote:

Previously I run:

!img2dataset \ --url_list /content/sample_data/mscoco.parquet \ --input_format "parquet" \ --url_col "URL" \ --caption_col "TEXT" \ --output_format webdataset \ --output_folder /content/sample_data/mscoco \ --processes_count 16 \ --thread_count 64 \ --image_size 256 \ --enable_wandb False

The last logged output was:

60it [1:49:22, 109.37s/it] worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 85 - count: 551753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 86 - count: 561753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 88 - count: 571753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 89 - count: 581753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 91 - count: 591753

There was no 100% success message, but it seems it finished. Is there a way to check the img2dataset process ended correctly?

The content of the latest stats JSON file ( /content/sample_data/mscoco/00059_stats.json) is:

{ "count": 1753, "successes": 1753, "failed_to_download": 0, "failed_to_resize": 0, "duration": 234.95621633529663, "start_time": 1645025798.0262048, "end_time": 1645026032.9824212, "status_dict": { "success": 1753 } }

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/75#issuecomment-1041893290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W2TCJZQ5P2RFJ3CNDU3PLGVANCNFSM5IPZD3FQ . You are receiving this because you were mentioned.Message ID: @.***>

ig-perez commented 2 years ago

Sorry, I forgot to mention:

/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
  warnings.warn("PyTorch version 1.7.1 or higher is recommended")
Traceback (most recent call last):
  File "/usr/local/bin/clip-retrieval", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/cli.py", line 21, in main
    "front": clip_front,
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 323, in clip_inference
    dataset = get_image_dataset()(preprocess, input_dataset, enable_text=enable_text, enable_image=enable_image)
  File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 69, in __init__
    self.keys = list(keys)
TypeError: 'NoneType' object is not iterable
rom1504 commented 2 years ago

Ok i see, the problem is you saved with the webdataset format but you're then reading with the files format Simply give webdataset to input_format option of clip inference

I will improve the error messages

ig-perez commented 2 years ago

Oh OK, sorry for that, I was following the related documentation on the img2dataset repo.

I added the input_format flag like this: !clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings" --input_format webdataset

But I'm having this message and nothing happened:

/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
  warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
0it [00:00, ?it/s]/usr/local/lib/python3.7/dist-packages/webdataset/handlers.py:34: UserWarning: IsADirectoryError(21, 'Is a directory', 'mscoco')
  warnings.warn(repr(exn))
0it [00:01, ?it/s]

I see the folders were created under embedding, but all of them are empty: image

I added the flag --num_prepro_workers 2 but still, nothing happened.

rom1504 commented 2 years ago

you need to give the file names as input like /content/sample_data/mscoco/{00000..00001}.tar with webdataset

ig-perez commented 2 years ago

I see. OK. !clip-retrieval inference --input_dataset "/content/sample_data/mscoco/00000.tar" --output_folder "/content/sample_data/embeddings" --input_format webdataset --num_prepro_workers 2 seems to be working now.

So this mean I need to manually run the inference command for each dataset part right? (60 in this case).

Choosing --output_format files makes the process automatic?

Sorry if I'm asking questions present in the documentation. This happens when working in a hurry :)

rom1504 commented 2 years ago

So this mean I need to manually run the inference command for each dataset part right? (60 in this case).

no, you can put /content/sample_data/mscoco/{00000..00059}.tar to do the inference on all

ig-perez commented 2 years ago

OK! Thanks a lot for your support and help Romain. I successfully finished the notebook. I processed just the first 10K examples (/content/sample_data/mscoco/00000) from COCO dataset, then I made some inference.

I guess Colab is not the right tool to generate the entire index for the 600K examples, but is enough for this kind of test.

As a suggestion, will it be possible to generate a txt file for each prediction in the output folder when using the filter command? That text files can contain the score of the prediction and also textual attributes. Also, it would be nice to sort the results by score.

Amiineh commented 2 years ago

Hey, I'm trying to train a mini dalle2 from lucidrains repo with mscoco dataset. I used img2dataset and this command to get the data:

img2dataset --url_list mscoco.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder mscoco --processes_count 16 --thread_count 64 --image_size 256\
             --enable_wandb True

then, I tried this repo to get the embeddings from the data by using this:

clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder 

which was giving me the same error as @ig-perez:

TypeError: 'NoneType' object is not iterable

so I followed the suggestions in this thread. But now I'm getting this error:

>clip-retrieval inference --input_dataset "mscoco/00000.tar" --output_folder embeddings_folder --input_format webdataset  --num_prepro_workers 2
The number of samples has been estimated to be 10000
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\Scripts\clip-retrieval.exe\__main__.py", line 7, in <module>
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\cli.py", line 16, in main
    fire.Fire(
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\main.py", line 144, in main
    distributor()
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\distributor.py", line 13, in __call__
    self.runner(i)
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\runner.py", line 36, in __call__
    batch = iterator.__next__()
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\reader.py", line 242, in __iter__
    for batch in self.dataloader:
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 368, in __iter__
    return self._get_iterator()
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 927, in __init__
    w.start()
  File "C:\Program Files\Python39\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Program Files\Python39\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Program Files\Python39\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Program Files\Python39\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Program Files\Python39\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'create_webdataset.<locals>.filter_dataset'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Do you have any suggestions on how I can fix this? @rom1504 Thank you

rom1504 commented 2 years ago

You need to specify webdataset type and give the .tar to the command, check the readme for details

Amiineh commented 2 years ago

In the latest command, I am giving it webdataset type and .tar:

clip-retrieval inference --input_dataset "mscoco/{00000..00059}.tar" --output_folder embeddings_folder --input_format webdataset  --num_prepro_workers 2

and I get the error above

Amiineh commented 2 years ago

I could fix it by setting --num_prepro_workers to 0. I think there is an issue with the threads and multiprocessing if you want to look into it.

rom1504 commented 2 years ago

are you using a virtual env ?

Amiineh commented 2 years ago

no, could it be that?

YUHANG-Ma commented 2 years ago

Hi, I meet a problem when using clip-retrieval. My dataset path is like this: /data1/train-{00000..00099}.tar. Each tar file contains .jpg and .cls which match each other. I want to use clip-retriecal to get img embedding. I run like this:

clip-retrieval inference --input_dataset /root/data0601/train-0001.tar --output_folder /root/npy0602 --input_format webdataset

I didn't encounter any issue but there is no img embed nor text embed in the output folder. Could I ask how I can fix it? image