Running out of memory when trying to compress

FFFiend commented 1 year ago

Basically title, my spec is 6 GB VRAM on a 1070.

I used a gist model, specifically your flan-t5-gist model on HuggingFace, along with bf16 precision as suggested inside compress however I keep running into a CUDA Out of Memory error. Is there a minimum amount of VRAM any system needs before they can make use of gisting? (In another issue you pointed at 12 GB being able to work, so I'm guessing my only option is to use Accelerate)

FFFiend commented 1 year ago

Does one need CuDNN installed for this to work? or is CUDA enough

jayelm commented 1 year ago

Hi, the FLAN-T5-gist model on huggingface is 11B parameters and needs around 20-30GB VRAM in bf16 inference mode. If you have less GPU VRAM you have two options: (1) look into lower precision inference e.g. https://github.com/TimDettmers/bitsandbytes or (2) train a smaller gist model from scratch (the training commands in the README support this, but unfortunately I don't have checkpoints for smaller gist models)

FFFiend commented 1 year ago

Is the codebase configured to automatically run everything in parallel with multiple GPUs (if available)? Or do we have to manually step in and run every instance of a HuggingFace Model in nn.Parallel mode?

jayelm commented 1 year ago

If you want to train with multiple GPUs in parallel, I would use the deepspeed command as shown in run_deepspeed.sh. I haven't tried HuggingFace's automatic dataparallel functionality with this codebase.

FFFiend commented 1 year ago

Oh no I meant just trying to do large model runs, b/c when I use multiple GPU's it just uses one and the only way I saw to circumvent that was to use HF dataparallelism, which I don't see inside compress.py Curious to know if you have some config setup to make use of all available memory

jayelm commented 1 year ago

Ah, for compress.py, there isn't any automatic functionality to do dataparallel, sorry!

FFFiend commented 1 year ago

My apologies, I meant model paralellism. I tried modifying compress.py (using your t5 model on HuggingFace) and passed in load_in_8bit=True, device_map="auto" and this quantization config into the from_pretrained method when called, but I was still unable to get it to load onto multiple GPU's. Any pointers?

Here is the quantization config

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

jayelm commented 1 year ago

Unfortunately I can't be of much help here—have not used BitsAndBytes myself. You might ask in the BitsAndBytes repo for help, as getting this working should be the same process as getting any default HF model working? Sorry!

jayelm commented 1 year ago

@FFFiend (commenting here so it's cleaner)

Given the discussion in this issue, I was under the impression you were trying to add some sort of data/model parallelism to compress.py that it currently doesn't support, which I probably can't help with much. However in #9 it seems like you're having trouble getting any version of compress.py to work, is that correct?

Note that for the base compress.py, you don't need 4xA100 80GB GPUs. In fact, you should only need a single 40GB GPU. It seems like you have access to 4xA100 40GB which should be enough. Can you try running unmodified compress.py using just one of the GPUs (e.g. by setting CUDA_VISIBLE_DEVICES=0?) Do you still run into issues?

FFFiend commented 1 year ago

I ran it on a single A100 40 GB GPU initially but I kept running into an Out of Memory error each time I tried which led me to explore model parallelism. And yep, still unable to run it unfortunately. I even tried modifying the T5ForConditionalGeneration class for model parallelism to that end haha :)

FFFiend commented 1 year ago

@jayelm was wondering if you had any thoughts about this^ 🙏

jayelm commented 1 year ago

Can you give more details of the command you are trying to run that OOMs? FLAN-T5-XXL in inference mode should be doable on a 40GB GPU in bfloat16 mode. What prompt are you trying to compress? Did you try the example in the README, namely

python -m src.compress --model_name_or_path jayelm/llama-7b-gist-1 --base_llama_path llama-7b \
    --instruction "Name the top cities in France that should not be missed. Include the best aspects of each place as well."

FFFiend commented 1 year ago

I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way

FFFiend commented 1 year ago

Also tangentially related: wouldn't the code for compress need modification if we want to run it but have more than one GPU? I don't see device_map="auto" anywhere so I'm curious how you run it on your multiple GPU set up.

jayelm commented 1 year ago

I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way

You'll need to give me a bit more detail so I can help debug—can you paste the exact command you ran, the exact traceback you get, the machine configuration you're running on, and any intuition about where in the script the GPU memory OOMs (e.g. during the model loading phase, the prompt compression phase, or the sampling phase)?

jayelm commented 1 year ago

Also tangentially related: wouldn't the code for compress need modification if we want to run it but have more than one GPU? I don't see device_map="auto" anywhere so I'm curious how you run it on your multiple GPU set up.

compress.py does not need, nor is it designed to be run on, more than one GPU. In fact it may even break if you run on a machine with more than one GPU and Huggingface tries to map to all GPUs (e.g. without setting CUDA_VISIBLE_DEVICES).

Multi-GPU is only needed for model training.

FFFiend commented 1 year ago

I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way

You'll need to give me a bit more detail so I can help debug—can you paste the exact command you ran, the exact traceback you get, the machine configuration you're running on, and any intuition about where in the script the GPU memory OOMs (e.g. during the model loading phase, the prompt compression phase, or the sampling phase)?

So I actually created a python module from the repo that I then used with this cloud computing service, https://modal.com, and executed this piece of code compress.main("jayelm/flan-t5-xxl-gist-1","Show me compression:") using 1 A100 40 GB GPU that they offer. I don't have an idea on what the CPU RAM on their cloud instances is, but I was able to get compress to run up until this error message I ran into this:

OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 39.42
GiB total capacity; 38.90 GiB already allocated; 11.00 MiB free; 38.90 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

jayelm commented 1 year ago

It's a bit hard for me to debug since you've wrapped this in additional code. Are you able to just clone this repo and run exactly the example command in the README on your cloud computing service to see whether it OOMs? I want to verify it's an issue solely with my codebase.

FFFiend commented 1 year ago

Alright so I copied nearly the exact instruction (had to use decapoda-research/llama-7b-hf) and evaded the OOM error :D, but I run into this now.

Compressing instruction
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 330, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 403, in call_function_sync
    res = fun(*args, **kwargs)
  File "/root/pipe.py", line 27, in complete
    compress.main(model_name_or_path="jayelm/llama-7b-gist-1",base_llama_path="decapoda-research/llama-7b-hf",
  File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/compress.py", line 148, in main
    gist_activations = model.get_gist_activations(
  File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 643, in get_gist_activations
    model_outputs = self.model(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 583, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 315, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 206, in forward
    query_states, key_states = apply_rotary_pos_emb(
TypeError: apply_rotary_pos_emb() got an unexpected keyword argument 'offset'

jayelm commented 1 year ago

Alright so I copied nearly the exact instruction (had to use decapoda-research/llama-7b-hf) and evaded the OOM error :D, but I run into this now.


Compressing instruction

Traceback (most recent call last):

  File "/pkg/modal/_container_entrypoint.py", line 330, in handle_input_exception

    yield

  File "/pkg/modal/_container_entrypoint.py", line 403, in call_function_sync

    res = fun(*args, **kwargs)

  File "/root/pipe.py", line 27, in complete

    compress.main(model_name_or_path="jayelm/llama-7b-gist-1",base_llama_path="decapoda-research/llama-7b-hf",

  File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

    return func(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/compress.py", line 148, in main

    gist_activations = model.get_gist_activations(

  File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

    return func(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 643, in get_gist_activations

    model_outputs = self.model(

  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 583, in forward

    layer_outputs = decoder_layer(

  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 315, in forward

    hidden_states, self_attn_weights, present_key_value = self.self_attn(

  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 206, in forward

    query_states, key_states = apply_rotary_pos_emb(

TypeError: apply_rotary_pos_emb() got an unexpected keyword argument 'offset'

See #10, your transformers version is likely wrong.

FFFiend commented 1 year ago

I'm unable to actually install that specific transformers version, curious how it works on your end 🤔

jayelm commented 1 year ago

Did you try pip install -r requirements.txt? Does it throw an error?

Assuming you cloned my repository, you should alternatively be able to clone the huggingface transformers repository as well, checkout the relevant commit, then do pip install -e . in the repo directory to install the package locally.

FFFiend commented 1 year ago

SOLVED! Thank you for your patience Jesse!

jayelm commented 1 year ago

Glad to hear! So you were able to run the compress command and it doesn't OOM?

FFFiend commented 1 year ago

yep, at long last haha

jayelm / gisting

Running out of memory when trying to compress #4