Closed FFFiend closed 1 year ago
Does one need CuDNN installed for this to work? or is CUDA enough
Hi, the FLAN-T5-gist model on huggingface is 11B parameters and needs around 20-30GB VRAM in bf16 inference mode. If you have less GPU VRAM you have two options: (1) look into lower precision inference e.g. https://github.com/TimDettmers/bitsandbytes or (2) train a smaller gist model from scratch (the training commands in the README support this, but unfortunately I don't have checkpoints for smaller gist models)
Is the codebase configured to automatically run everything in parallel with multiple GPUs (if available)? Or do we have to manually step in and run every instance of a HuggingFace Model in nn.Parallel mode?
If you want to train with multiple GPUs in parallel, I would use the deepspeed command as shown in run_deepspeed.sh
. I haven't tried HuggingFace's automatic dataparallel functionality with this codebase.
Oh no I meant just trying to do large model runs, b/c when I use multiple GPU's it just uses one and the only way I saw to circumvent that was to use HF dataparallelism, which I don't see inside compress.py
Curious to know if you have some config setup to make use of all available memory
Ah, for compress.py
, there isn't any automatic functionality to do dataparallel, sorry!
My apologies, I meant model paralellism. I tried modifying compress.py
(using your t5 model on HuggingFace) and passed in load_in_8bit=True
, device_map="auto"
and this quantization config into the from_pretrained
method when called, but I was still unable to get it to load onto multiple GPU's. Any pointers?
Here is the quantization config
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
Unfortunately I can't be of much help here—have not used BitsAndBytes myself. You might ask in the BitsAndBytes repo for help, as getting this working should be the same process as getting any default HF model working? Sorry!
@FFFiend (commenting here so it's cleaner)
Given the discussion in this issue, I was under the impression you were trying to add some sort of data/model parallelism to compress.py
that it currently doesn't support, which I probably can't help with much. However in #9 it seems like you're having trouble getting any version of compress.py
to work, is that correct?
Note that for the base compress.py
, you don't need 4xA100 80GB GPUs. In fact, you should only need a single 40GB GPU. It seems like you have access to 4xA100 40GB which should be enough. Can you try running unmodified compress.py
using just one of the GPUs (e.g. by setting CUDA_VISIBLE_DEVICES=0
?) Do you still run into issues?
I ran it on a single A100 40 GB GPU initially but I kept running into an Out of Memory error each time I tried which led me to explore model parallelism. And yep, still unable to run it unfortunately. I even tried modifying the T5ForConditionalGeneration class for model parallelism to that end haha :)
@jayelm was wondering if you had any thoughts about this^ 🙏
Can you give more details of the command you are trying to run that OOMs? FLAN-T5-XXL in inference mode should be doable on a 40GB GPU in bfloat16 mode. What prompt are you trying to compress? Did you try the example in the README, namely
python -m src.compress --model_name_or_path jayelm/llama-7b-gist-1 --base_llama_path llama-7b \
--instruction "Name the top cities in France that should not be missed. Include the best aspects of each place as well."
I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way
Also tangentially related: wouldn't the code for compress
need modification if we want to run it but have more than one GPU? I don't see device_map="auto"
anywhere so I'm curious how you run it on your multiple GPU set up.
I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way
You'll need to give me a bit more detail so I can help debug—can you paste the exact command you ran, the exact traceback you get, the machine configuration you're running on, and any intuition about where in the script the GPU memory OOMs (e.g. during the model loading phase, the prompt compression phase, or the sampling phase)?
Also tangentially related: wouldn't the code for compress need modification if we want to run it but have more than one GPU? I don't see device_map="auto" anywhere so I'm curious how you run it on your multiple GPU set up.
compress.py
does not need, nor is it designed to be run on, more than one GPU. In fact it may even break if you run on a machine with more than one GPU and Huggingface tries to map to all GPUs (e.g. without setting CUDA_VISIBLE_DEVICES
).
Multi-GPU is only needed for model training.
I ran that exact command, albeit with shorter prompts than the one in the example. Unfortunately ran into OOM's all the way
You'll need to give me a bit more detail so I can help debug—can you paste the exact command you ran, the exact traceback you get, the machine configuration you're running on, and any intuition about where in the script the GPU memory OOMs (e.g. during the model loading phase, the prompt compression phase, or the sampling phase)?
So I actually created a python module from the repo that I then used with this cloud computing service, https://modal.com, and executed this piece of code compress.main("jayelm/flan-t5-xxl-gist-1","Show me compression:")
using 1 A100 40 GB GPU that they offer. I don't have an idea on what the CPU RAM on their cloud instances is, but I was able to get compress to run up until this error message I ran into this:
OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 39.42
GiB total capacity; 38.90 GiB already allocated; 11.00 MiB free; 38.90 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
It's a bit hard for me to debug since you've wrapped this in additional code. Are you able to just clone this repo and run exactly the example command in the README on your cloud computing service to see whether it OOMs? I want to verify it's an issue solely with my codebase.
Alright so I copied nearly the exact instruction (had to use decapoda-research/llama-7b-hf) and evaded the OOM error :D, but I run into this now.
Compressing instruction
Traceback (most recent call last):
File "/pkg/modal/_container_entrypoint.py", line 330, in handle_input_exception
yield
File "/pkg/modal/_container_entrypoint.py", line 403, in call_function_sync
res = fun(*args, **kwargs)
File "/root/pipe.py", line 27, in complete
compress.main(model_name_or_path="jayelm/llama-7b-gist-1",base_llama_path="decapoda-research/llama-7b-hf",
File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/gisting_test/src/compress.py", line 148, in main
gist_activations = model.get_gist_activations(
File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 643, in get_gist_activations
model_outputs = self.model(
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 583, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 315, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 206, in forward
query_states, key_states = apply_rotary_pos_emb(
TypeError: apply_rotary_pos_emb() got an unexpected keyword argument 'offset'
Alright so I copied nearly the exact instruction (had to use decapoda-research/llama-7b-hf) and evaded the OOM error :D, but I run into this now.
Compressing instruction Traceback (most recent call last): File "/pkg/modal/_container_entrypoint.py", line 330, in handle_input_exception yield File "/pkg/modal/_container_entrypoint.py", line 403, in call_function_sync res = fun(*args, **kwargs) File "/root/pipe.py", line 27, in complete compress.main(model_name_or_path="jayelm/llama-7b-gist-1",base_llama_path="decapoda-research/llama-7b-hf", File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/gisting_test/src/compress.py", line 148, in main gist_activations = model.get_gist_activations( File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 643, in get_gist_activations model_outputs = self.model( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 583, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 315, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/gisting_test/src/gist_llama.py", line 206, in forward query_states, key_states = apply_rotary_pos_emb( TypeError: apply_rotary_pos_emb() got an unexpected keyword argument 'offset'
See #10, your transformers version is likely wrong.
I'm unable to actually install that specific transformers version, curious how it works on your end 🤔
Did you try pip install -r requirements.txt
? Does it throw an error?
Assuming you cloned my repository, you should alternatively be able to clone the huggingface transformers repository as well, checkout the relevant commit, then do pip install -e .
in the repo directory to install the package locally.
SOLVED! Thank you for your patience Jesse!
Glad to hear! So you were able to run the compress command and it doesn't OOM?
yep, at long last haha
Basically title, my spec is 6 GB VRAM on a 1070.
I used a gist model, specifically your flan-t5-gist model on HuggingFace, along with bf16 precision as suggested inside
compress
however I keep running into a CUDA Out of Memory error. Is there a minimum amount of VRAM any system needs before they can make use of gisting? (In another issue you pointed at 12 GB being able to work, so I'm guessing my only option is to use Accelerate)