huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
755 stars 55 forks source link

TypeError: _to_copy() takes from 2 to 3 positional arguments but 4 were given #289

Open arseniybelkov opened 3 weeks ago

arseniybelkov commented 3 weeks ago

I use the following code

from transformers import PaliGemmaForConditionalGeneration

model = PaliGemmaForConditionalGeneration.from_pretrained(

from optimum.quanto import quantize, qint8, freeze

quantize(model, weights=qint8, activations=qint8)

with torch.amp.autocast("cuda"):
    text_features = model.generate(
        max_new_tokens=100, do_sample=False,

it results in the error (don't pay attention to Thread[127], it is all inside trition inference server

Thread [127] had error: in ensemble 'ensemble_model', Failed to process the request(s) for model instance 'generator_0_0', message: TypeError: _to_copy() takes from 2 to 3 positional arguments but 4 were given

  /usr/local/lib/python3.10/dist-packages/optimum/quanto/tensor/ __torch_dispatch__
  /usr/local/lib/python3.10/dist-packages/optimum/quanto/tensor/ __torch_function__
  /usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/ _merge_input_ids_with_image_features
  /usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/ forward
  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/ _call_impl
  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/ _wrapped_call_impl
  /usr/local/lib/python3.10/dist-packages/transformers/generation/ _sample
  /usr/local/lib/python3.10/dist-packages/transformers/generation/ generate
  /usr/local/lib/python3.10/dist-packages/torch/utils/ decorate_context
  /usr/src/app/model_repository/generator/1/ execute
dacorvo commented 2 weeks ago

On the current main, this works:

import torch
import requests

from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from optimum.quanto import quantize, qint8, freeze

model_id = "google/paligemma-3b-mix-224"

processor = AutoProcessor.from_pretrained(model_id)
prompt = "What are the cats doing ?"
url = ""
image =, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')

model = PaliGemmaForConditionalGeneration.from_pretrained(

quantize(model, weights=qint8)

generate_ids = model.generate(**inputs, max_new_tokens=30)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

But it does not work if you also quantize the activations (see #299 ).

arseniybelkov commented 2 weeks ago

@dacorvo about quantizing only weights I ran your example, but difference in memory usage is almost negligible. Could you please explain the reason why that's happeing?

dacorvo commented 2 weeks ago

@arseniybelkov did you empty the CUDA cache after freezing the model ? Not all saved GPU memory might have been released after quantization.

arseniybelkov commented 2 weeks ago

yeah, I emptied that

arseniybelkov commented 2 weeks ago

if you mean doing
