meta-llama / llama

Inference code for Llama models
Other
56.11k stars 9.54k forks source link

with RTX 4070 12 GB it is giving me CUDA out of memory error #466

Open dhruvildarji opened 1 year ago

dhruvildarji commented 1 year ago

I am trying to understand what am I doing wrong here?

Is it true that even smallest size of any llama2 model is 13 Gig (llama-2-7b/consolidated.00.pth) ? And that is the reason it is not working in my 12 Gig 4070 Nvidia GPU?

Is there any any workaround?

Here is the error I am receiving.

`idea@myidea:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir llama-2-7b/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 4

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 55, in fire.Fire(main) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 18, in main generator = Llama.build( File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build model = Transformer(model_args) File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init self.layers.append(TransformerBlock(layer_id, params)) File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init self.feed_forward = FeedForward( File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init self.w3 = ColumnParallelLinear( File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init self.weight = Parameter(torch.Tensor(self.output_size_per_partition, self.in_features)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.72 GiB total capacity; 10.93 GiB already allocated; 59.19 MiB free; 10.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 330097) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/idea/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: `

example_text_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-20_16:08:32 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330097) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
realhaik commented 1 year ago

Yes, I think that the minimum vram for 7b is 16 GB

aroncds commented 1 year ago

I think it should work. I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.

dhruvildarji commented 1 year ago

Did you try with 3 GPUs together? or individually?

For mine, it doesn't work individually.

Can you tell how did you make it work for one GPU?

Dhruvil

On Thu, 20 Jul 2023 at 16:27, Aron de Castro @.***> wrote:

I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644782571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVQWGTH64YOBD3LDCATXRG5ENANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 @.

aroncds commented 1 year ago

Individually.

I think I have not done anything different.

I used Windows WSL Ubuntu.

  1. I installed CUDA toolkit 11.7

  2. I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

  3. And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"

dhruvildarji commented 1 year ago

Interesting!!

I am doing the same thing,

then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

@.***:~/dhruvil/git/llama*$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

initializing model parallel with size 1

initializing ddp with size 1

initializing pipeline with size 1

/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)

_C._set_default_tensor_type(t)

Traceback (most recent call last):

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73, in

fire.Fire(main)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire

component_trace = _Fire(component, args, parsed_flag_args, context,

name)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire

component, remaining_args = _CallAndUpdateTrace(

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace

component = fn(*varargs, **kwargs)

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main

generator = Llama.build(

File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build

model = Transformer(model_args)

File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init

self.layers.append(TransformerBlock(layer_id, params))

File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init

self.feed_forward = FeedForward(

File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init

self.w3 = ColumnParallelLinear(

File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init

self.weight = Parameter(torch.Tensor(self.output_size_per_partition,

self.in_features))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3

Traceback (most recent call last):

File "/home/idea/.local/bin/torchrun", line 8, in

sys.exit(main())

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper

return f(*args, **kwargs)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main

run(args)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run

elastic_launch(

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

example_chat_completion.py FAILED


Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-20_18:45:19 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330867) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ***@***.****:*~/dhruvil/git/llama*$ On Thu, 20 Jul 2023 at 17:08, Aron de Castro ***@***.***> wrote: > Individually. > > I think I have not done anything different. > > I used Windows WSL Ubuntu. > > 1. > > I installed CUDA toolkit 11.7 > 2. > > I installed the requirements, but I used a different torch package -> > pip3 install numpy --pre torch --force-reinstall --index-url > https://download.pytorch.org/whl/nightly/cu117 > 3. > > And I tested "torchrun --nproc_per_node 1 example_chat_completion.py > --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len > 512 --max_batch_size 4" > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > -- Thank you, Dhruvil Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** ***@***.***
dhruvildarji commented 1 year ago

This is how my nvidia-smi looks like .

I have 4070 with 12 Gig.

@.***:~/dhruvil/git/llama*$ nvidia-smi

Thu Jul 20 18:47:14 2023

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA GeForce RTX 4070 Off| 00000000:04:00.0 Off | N/A |

| 0% 40C P8 2W / 200W| 197MiB / 12282MiB | 0% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

Processes:

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

| 0 N/A N/A 1834 G /usr/lib/xorg/Xorg 155MiB |

| 0 N/A N/A 1978 G /usr/bin/gnome-shell 11MiB |

| 0 N/A N/A 52927 G ...76579054,1620300079093577791,262144 25MiB |

| 0 N/A N/A 178873 G gnome-control-center 2MiB |

+---------------------------------------------------------------------------------------+

@.***:~/dhruvil/git/llama*$

On Thu, 20 Jul 2023 at 18:46, Dhruvil Darji @.***> wrote:

Interesting!!

I am doing the same thing,

then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

@.***:~/dhruvil/git/llama*$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

initializing model parallel with size 1

initializing ddp with size 1

initializing pipeline with size 1

/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)

_C._set_default_tensor_type(t)

Traceback (most recent call last):

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73, in

fire.Fire(main)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire

component_trace = _Fire(component, args, parsed_flag_args, context,

name)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire

component, remaining_args = _CallAndUpdateTrace(

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace

component = fn(*varargs, **kwargs)

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main

generator = Llama.build(

File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build

model = Transformer(model_args)

File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init

self.layers.append(TransformerBlock(layer_id, params))

File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init

self.feed_forward = FeedForward(

File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init

self.w3 = ColumnParallelLinear(

File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init

self.weight = Parameter(torch.Tensor(self.output_size_per_partition,

self.in_features))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3

Traceback (most recent call last):

File "/home/idea/.local/bin/torchrun", line 8, in

sys.exit(main())

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper

return f(*args, **kwargs)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main

run(args)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run

elastic_launch(

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

example_chat_completion.py FAILED


Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-20_18:45:19 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330867) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ***@***.****:*~/dhruvil/git/llama*$ On Thu, 20 Jul 2023 at 17:08, Aron de Castro ***@***.***> wrote: > Individually. > > I think I have not done anything different. > > I used Windows WSL Ubuntu. > > 1. > > I installed CUDA toolkit 11.7 > 2. > > I installed the requirements, but I used a different torch package -> > pip3 install numpy --pre torch --force-reinstall --index-url > https://download.pytorch.org/whl/nightly/cu117 > 3. > > And I tested "torchrun --nproc_per_node 1 example_chat_completion.py > --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len > 512 --max_batch_size 4" > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > -- Thank you, Dhruvil Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** ***@***.***

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 @.

aroncds commented 1 year ago

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.01 Driver Version: 536.67 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1070 Ti On | 00000000:07:00.0 On | N/A | | 0% 43C P5 10W / 180W | 373MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

aroncds commented 1 year ago

I am not expert with this, but maybe the cuda cores amount can require more memory, just sharing my thoughts. My model is pretty old now.

pzim-devdata commented 1 year ago

If you want to try llama with a cpu installation you can install this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

  1. download the original version of Llama from : https://github.com/facebookresearch/llama and extract it to a llama-main folder
  2. download th cpu version from : https://github.com/krychu/llama and extract it and replace files in the llama-main folder
  3. run the download.sh script in a terminal, passing the URL provided when prompted to start the download
  4. go to the llama-main folder
  5. cretate an Python3 env : python3 -m venv env and activate it : source env/bin/activate
  6. install the cpu version of pytorch : python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
  7. install dependencies off llama : python3 -m pip install -e .
  8. run if you have downloaded llama-2-7b :
    torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 1 #(instead of 4)
hdnh2006 commented 1 year ago
  1. I

I tried with RTX 2060 8GB and 64GB RAM and it doesn't work. I am impressed that you were able to deploy it on local PC.

lolevsky commented 11 months ago

@dhruvildarji Was you able to solve the issue? I am trying to run on RTX 4070 12 GB in Ubuntu and have same issue