Open arseniybelkov opened 3 weeks ago
On the current main
, this works:
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from optimum.quanto import quantize, qint8, freeze
model_id = "google/paligemma-3b-mix-224"
processor = AutoProcessor.from_pretrained(model_id)
prompt = "What are the cats doing ?"
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to('cuda').eval()
quantize(model, weights=qint8)
freeze(model)
generate_ids = model.generate(**inputs, max_new_tokens=30)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
But it does not work if you also quantize the activations (see #299 ).
@dacorvo about quantizing only weights I ran your example, but difference in memory usage is almost negligible. Could you please explain the reason why that's happeing?
@arseniybelkov did you empty the CUDA cache after freezing the model ? Not all saved GPU memory might have been released after quantization.
yeah, I emptied that
if you mean doing
gc.collect()
torch.cuda.empty_cache()
I use the following code
it results in the error (don't pay attention to Thread[127], it is all inside trition inference server