merveenoyan / smol-vision

Recipes for shrinking, optimizing, customizing cutting edge vision models. 💜
Apache License 2.0
671 stars 68 forks source link

Running on H100? #9

Open joris-sense opened 2 weeks ago

joris-sense commented 2 weeks ago

Hey, when trying to run Idefics_FT.ipynb on a H100 machine, I seem to be getting the problem described here. Is there a way around this, using something else than bitsandbytes maybe?

merveenoyan commented 2 weeks ago

@joris-sense I ran in an A100 instance and not H100 :( can't you do only LoRA or full FT since you have an access to an H100?

joris-sense commented 2 weeks ago

I am sitting on it right now and the training loop seems to work when I replace bitsandbytes by quanto =)

So I use

from transformers import QuantoConfig

if USE_QLORA:
    quanto_config = QuantoConfig(weights="int4")
    model = Idefics3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quanto_config if USE_QLORA else None, #bnb_config
_attn_implementation="flash_attention_2", )

Does this have disadvantages compared to bitsandbytes, or is there something else I should use?

merveenoyan commented 2 weeks ago

@joris-sense I think they're the same thing, if anything Quanto is more up to date

joris-sense commented 2 weeks ago

@joris-sense I ran in an A100 instance and not H100 :( can't you do only LoRA or full FT since you have an access to an H100?

My understanding was that the main advantage of LoRA/QLORA is the reduced memory requirement rather than improved speed? In any case, trying it out, the H100 has similar speed for all 3 methods.

Thinking about it, why does the Jupyter notebook take 50 GB of VRAM even when training a QLOR model with 8B parameters? Shouldn't it be a lot less for 4 bit, on the order of 4 GB?

merveenoyan commented 2 weeks ago

@joris-sense I forgot to mention there but my training setup was only freezing image encoder and not doing LoRA training. I have now uploaded new versions of the notebook and script that is much more QLoRA focused and realized it takes around 17GB VRAM, with 0.002% of params being trained. Can you try?

joris-sense commented 1 week ago

That still seems like a lot to me, it looks like the model's weights are stored unquantized as this is more than 2*8 GB of VRAM (as I understand, with QLORA the model weights should be stored with quantization as well).

I didn't get your new script to work on my machine with QLORA and stick to full finetuning for now. I seem to get different errors in each execution, but one of them was that flashattention complains during inference (which I copied from your last version -- I am also missing an inference part in the new one) that the model's weights are stored in float32 (as seen by

dtype = next(model.parameters()).dtype

print(dtype)

) and it throws an error "FlashAttention only support fp16 and bf16 data type"). If I convert the weights, the model does infer but it seems to be unrelated to what it is trained on. I also didn't get below 50 GB of VRAM, and sometimes still get out of memory errors, and with your new global variables, the thing didn't seem to find a GPU on a one-H100 setup (I know this is probably too vague to act upon, maybe I will figure out more and make a more reproducible report over the weekend). Also note that your notebook's default settings are still USE_LORA=False and USE_QLORA=False, and it still references model before it is defined in cell 8.