huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
755 stars 55 forks source link

TypeError: _to_copy() takes from 2 to 3 positional arguments but 4 were given #289

Open arseniybelkov opened 3 weeks ago

arseniybelkov commented 3 weeks ago

I use the following code

from transformers import PaliGemmaForConditionalGeneration

model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma-3b-mix-224",
    torch_dtype=torch.float16,
    device_map="auto",
).eval()

from optimum.quanto import quantize, qint8, freeze

quantize(model, weights=qint8, activations=qint8)
freeze(model)

with torch.amp.autocast("cuda"):
    text_features = model.generate(
        input_ids=input_ids_batch,
        pixel_values=pixel_values_batch,
        attention_mask=attention_mask_batch,
        max_new_tokens=100, do_sample=False,
    )

it results in the error (don't pay attention to Thread[127], it is all inside trition inference server

Thread [127] had error: in ensemble 'ensemble_model', Failed to process the request(s) for model instance 'generator_0_0', message: TypeError: _to_copy() takes from 2 to 3 positional arguments but 4 were given

At:
  /usr/local/lib/python3.10/dist-packages/optimum/quanto/tensor/qbytes.py(130): __torch_dispatch__
  /usr/local/lib/python3.10/dist-packages/optimum/quanto/tensor/qtensor.py(93): __torch_function__
  /usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/modeling_paligemma.py(314): _merge_input_ids_with_image_features
  /usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/modeling_paligemma.py(435): forward
  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1562): _call_impl
  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1553): _wrapped_call_impl
  /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py(2982): _sample
  /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py(2024): generate
  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(116): decorate_context
  /usr/src/app/model_repository/generator/1/model.py(80): execute
dacorvo commented 2 weeks ago

On the current main, this works:

import torch
import requests

from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from optimum.quanto import quantize, qint8, freeze

model_id = "google/paligemma-3b-mix-224"

processor = AutoProcessor.from_pretrained(model_id)
prompt = "What are the cats doing ?"
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
).to('cuda').eval()

quantize(model, weights=qint8)
freeze(model)

generate_ids = model.generate(**inputs, max_new_tokens=30)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

But it does not work if you also quantize the activations (see #299 ).

arseniybelkov commented 2 weeks ago

@dacorvo about quantizing only weights I ran your example, but difference in memory usage is almost negligible. Could you please explain the reason why that's happeing?

dacorvo commented 2 weeks ago

@arseniybelkov did you empty the CUDA cache after freezing the model ? Not all saved GPU memory might have been released after quantization.

arseniybelkov commented 2 weeks ago

yeah, I emptied that

arseniybelkov commented 2 weeks ago

if you mean doing

gc.collect()
torch.cuda.empty_cache()