unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and šŸ”œ video, up to 5x faster than OpenAI CLIP and LLaVA šŸ–¼ļø & šŸ–‹ļø
https://unum-cloud.github.io/uform/
Apache License 2.0
982 stars 56 forks source link

Benchmark script errors on loading InstructBLIP processor #59

Open lmmx opened 6 months ago

lmmx commented 6 months ago

I've tried running the code and found what looks like a bug in the benchmark script, I'm just diagnosing now

The traceback seems to point to the type of the image parameter at line 68:

 53 def bench_captions(
 54     model,
 55     processor,
 56     prompt: str,
 57     images: List[Image.Image],
 58 ) -> List[str]:
 59     total_duration = 0
 60     total_length = 0
 61     model = torch.compile(model)
 62     for image in images:
 63         seconds, text = duration(
 64             lambda: caption(
 65                 model=model,
 66                 processor=processor,
 67                 prompt=prompt,
 68                 image=image,
 69             )
 70         )
 71         total_duration += seconds
 72         total_length += len(text)
 73 
 74     del model
 75     del processor
 76     print(f"Throughput: {total_length/total_duration:.2f} tokens/s")
Click to expand traceback (captured by pytest) ``` scripts/bench.py:141: in bench_captions( scripts/bench.py:63: in bench_captions seconds, text = duration( scripts/bench.py:48: in duration result = callable() scripts/bench.py:64: in lambda: caption( scripts/bench.py:22: in caption inputs = processor(prompt, image, return_tensors="pt") /home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/models/instructblip/processing_instructblip.py:89: in __call__ text_encoding = self.tokenizer( /home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2802: in __call__ encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) /home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2860: in _call_one raise ValueError( E ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > /home/louis/miniconda3/envs/uform/lib/python3.11/site-packages/transformers/tokenization_utils_base.py(2860)_call_one() ```

I expanded this code out [no lambda] and it still gives the same error but the data flow is clearer

def bench_captions(
    model,
    processor,
    prompt: str,
    images: List[Image.Image],
) -> List[str]:
    total_duration = 0
    total_length = 0
    model = torch.compile(model)

    def caption_image(image, model=model, processor=processor, prompt=prompt):
        return caption(model=model, processor=processor, prompt=prompt, image=image)

    for image in images:
        seconds, text = duration(partial(caption_image, image=image))
        total_duration += seconds
        total_length += len(text)

    del model
    del processor
    print(f"Throughput: {total_length/total_duration:.2f} tokens/s")

The traceback is pointing to the loading of the processor of the InstructBLIP model.

It was reported but not resolved in transformers (I think unrelated https://github.com/huggingface/transformers/issues/21366)

The bug seems to be that we are passing unnamed arguments, and they're getting misused as a result:

        inputs = processor(prompt, image, return_tensors="pt")

The InstructBLIP signature is __call__(self, images, text)

(Pdb) pp self.__call__.__func__.__code__.co_varnames
('self',
 'images',
 'text',
...

The docs say that

The InstructBlipForConditionalGeneration forward method, overrides the __call__ special method.

I think this must be what is supposed to be getting called.

Debugging in PDB shows this is what is happening

(Pdb) p images
'Summarize the visual content of the image.'
(Pdb) p text
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2787x4181 at 0x7FBE4A910090>

Does this reproduce for you?

Cause

Update I found the cause is indeed passing positional args, if you print the processor param names they are, respectively:

I'm surprised this benchmark was working before

Solution

Since the parameter order varies you can't use positional args, but the parameter names differ too: text/texts.

In fact the odd one out here is from uform itself, so that should change, and this will work.

You can't just pass images=image (InstructBlipProcessor will get multiple values for the argument images)

This cannot be solved by passing text=text to Uform-Gen's VLMProcessor, that leads to a later error in the model.generate step.

It looks like switching the order of these arguments in VLMProcessor is the best solution.

If I patch it, everything works (but that's not to say don't fix the VLMProcessor argument order!).

def caption(model, processor, prompt: str, image: Image.Image) -> str:
    var_names = processor.__call__.__func__.__code__.co_varnames
    prompt_kwarg = next(kw for kw in iter(var_names) if kw.startswith("text"))
    processor_kwargs = {prompt_kwarg: prompt, "images": image, "return_tensors": "pt"}
    inputs = processor(**processor_kwargs)
...

Environment details

Click to show full pip list ``` (uform) louis šŸŒŸ ~/lab/uform/uform $ pip list Package Version Editable project location ------------------ ---------- --------------------------- Brotli 1.0.9 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 2.0.4 cryptography 41.0.7 filelock 3.13.1 fsspec 2023.12.2 gmpy2 2.1.2 huggingface-hub 0.20.1 idna 3.4 iniconfig 2.0.0 Jinja2 3.1.2 MarkupSafe 2.1.1 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 mpmath 1.3.0 networkx 3.1 numpy 1.26.2 packaging 23.2 Pillow 10.0.1 pip 23.3.1 pluggy 1.3.0 pycparser 2.21 pyOpenSSL 23.2.0 PySocks 1.7.1 pytest 7.4.4 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 safetensors 0.4.1 setuptools 68.2.2 sympy 1.12 tokenizers 0.15.0 torch 2.1.2 torchaudio 2.1.2 torchvision 0.16.2 tqdm 4.66.1 transformers 4.36.2 triton 2.1.0 typing_extensions 4.7.1 uform 1.0.3 /home/louis/lab/uform/uform urllib3 1.26.18 wheel 0.41.2 ```
Click to show full conda list ``` # packages in environment at /home/louis/miniconda3/envs/uform: # # Name Version Build Channel _libgcc_mutex 0.1 main _openmp_mutex 5.1 1_gnu blas 1.0 mkl brotli-python 1.0.9 py311h6a678d5_7 bzip2 1.0.8 h7b6447c_0 ca-certificates 2023.12.12 h06a4308_0 certifi 2023.11.17 py311h06a4308_0 cffi 1.16.0 py311h5eee18b_0 charset-normalizer 2.0.4 pyhd3eb1b0_0 cryptography 41.0.7 py311hdda0065_0 cuda-cudart 11.8.89 0 nvidia cuda-cupti 11.8.87 0 nvidia cuda-libraries 11.8.0 0 nvidia cuda-nvrtc 11.8.89 0 nvidia cuda-nvtx 11.8.86 0 nvidia cuda-runtime 11.8.0 0 nvidia ffmpeg 4.3 hf484d3e_0 pytorch filelock 3.13.1 py311h06a4308_0 freetype 2.12.1 h4a9f257_0 fsspec 2023.12.2 pypi_0 pypi giflib 5.2.1 h5eee18b_3 gmp 6.2.1 h295c915_3 gmpy2 2.1.2 py311hc9b5ff0_0 gnutls 3.6.15 he1e5248_0 huggingface-hub 0.20.1 pypi_0 pypi idna 3.4 py311h06a4308_0 iniconfig 2.0.0 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306 jinja2 3.1.2 py311h06a4308_0 jpeg 9e h5eee18b_1 lame 3.100 h7b6447c_0 lcms2 2.12 h3be6417_0 ld_impl_linux-64 2.38 h1181459_1 lerc 3.0 h295c915_0 libcublas 11.11.3.6 0 nvidia libcufft 10.9.0.58 0 nvidia libcufile 1.8.1.2 0 nvidia libcurand 10.3.4.101 0 nvidia libcusolver 11.4.1.48 0 nvidia libcusparse 11.7.5.86 0 nvidia libdeflate 1.17 h5eee18b_1 libffi 3.4.4 h6a678d5_0 libgcc-ng 11.2.0 h1234567_1 libgomp 11.2.0 h1234567_1 libiconv 1.16 h7f8727e_2 libidn2 2.3.4 h5eee18b_0 libjpeg-turbo 2.0.0 h9bf148f_0 pytorch libnpp 11.8.0.86 0 nvidia libnvjpeg 11.9.0.86 0 nvidia libpng 1.6.39 h5eee18b_0 libstdcxx-ng 11.2.0 h1234567_1 libtasn1 4.19.0 h5eee18b_0 libtiff 4.5.1 h6a678d5_0 libunistring 0.9.10 h27cfd23_0 libuuid 1.41.5 h5eee18b_0 libwebp 1.3.2 h11a3e52_0 libwebp-base 1.3.2 h5eee18b_0 llvm-openmp 14.0.6 h9e868ea_0 lz4-c 1.9.4 h6a678d5_0 markupsafe 2.1.1 py311h5eee18b_0 mkl 2023.1.0 h213fc3f_46344 mkl-service 2.4.0 py311h5eee18b_1 mkl_fft 1.3.8 py311h5eee18b_0 mkl_random 1.2.4 py311hdb19cb5_0 mpc 1.1.0 h10f8cd9_1 mpfr 4.0.2 hb69a4c5_1 mpmath 1.3.0 py311h06a4308_0 ncurses 6.4 h6a678d5_0 nettle 3.7.3 hbbd107a_1 networkx 3.1 py311h06a4308_0 numpy 1.26.2 py311h08b1b3b_0 numpy-base 1.26.2 py311hf175353_0 openh264 2.1.1 h4ff587b_0 openjpeg 2.4.0 h3ad879b_0 openssl 3.0.12 h7f8727e_0 packaging 23.2 pypi_0 pypi pillow 10.0.1 py311ha6cbd5a_0 pip 23.3.1 py311h06a4308_0 pluggy 1.3.0 pypi_0 pypi pycparser 2.21 pyhd3eb1b0_0 pyopenssl 23.2.0 py311h06a4308_0 pysocks 1.7.1 py311h06a4308_0 pytest 7.4.4 pypi_0 pypi python 3.11.5 h955ad1f_0 pytorch 2.1.2 py3.11_cuda11.8_cudnn8.7.0_0 pytorch pytorch-cuda 11.8 h7e8668a_5 pytorch pytorch-mutex 1.0 cuda pytorch pyyaml 6.0.1 py311h5eee18b_0 readline 8.2 h5eee18b_0 regex 2023.12.25 pypi_0 pypi requests 2.31.0 py311h06a4308_0 safetensors 0.4.1 pypi_0 pypi setuptools 68.2.2 py311h06a4308_0 sqlite 3.41.2 h5eee18b_0 sympy 1.12 py311h06a4308_0 tbb 2021.8.0 hdb19cb5_0 tk 8.6.12 h1ccaba5_0 tokenizers 0.15.0 pypi_0 pypi torchaudio 2.1.2 py311_cu118 pytorch torchtriton 2.1.0 py311 pytorch torchvision 0.16.2 py311_cu118 pytorch tqdm 4.66.1 pypi_0 pypi transformers 4.36.2 pypi_0 pypi typing_extensions 4.7.1 py311h06a4308_0 tzdata 2023c h04d1e81_0 uform 1.0.3 pypi_0 pypi urllib3 1.26.18 py311h06a4308_0 wheel 0.41.2 py311h06a4308_0 xz 5.4.5 h5eee18b_0 yaml 0.2.5 h7b6447c_0 zlib 1.2.13 h5eee18b_0 zstd 1.5.5 hc292b87_0 ```
lmmx commented 6 months ago

These are the results I get on 3090, not sure if they're meant to correspond to the table in README or something's changed

UForm-Gen
Throughput: 193.65 tokens/s (run 1)
Throughput: 198.49 tokens/s (run 2)
LLaVA
Throughput: 164.27 tokens/s (run 1)
Throughput: 166.39 tokens/s (run 2)
InstructBLIP
Throughput: 167.85 tokens/s (run 1)
Throughput: 165.90 tokens/s (run 2)
UForm-English
Throughput: 10.68 images/s (run 1)
Throughput: 12.66 images/s (run 2)
Throughput: 202.97 queries/s (run 1)
Throughput: 203.07 queries/s (run 2)
UForm-Multilingual
Throughput: 11.95 images/s (run 1)
Throughput: 12.49 images/s (run 2)
Throughput: 235.77 queries/s (run 1)
Throughput: 240.95 queries/s (run 2)