juncongmoo / pyllama

LLaMA: Open and Efficient Foundation Language Models
GNU General Public License v3.0
2.8k stars 312 forks source link

Killed #62

Open javierp183 opened 1 year ago

javierp183 commented 1 year ago

Hello all, I installed the requirements of project but when I try to execute the following command:

python -m llama.llama_quant decapoda-research/llama-7b-hf c4 --wbits 2 --save pyllama-7B2b.pt

I got this message -> "Killed". Could you help me to determinate better the issue and fix. thanks

raviteja5 commented 1 year ago

I get the same 'Killed' message when I run Single GPU inference without quantization on Linux:

python inference.py --ckpt_dir $CKPT_DIR --tokenizer_path $TOKENIZER_PATH

javierp183 commented 1 year ago

I get the same 'Killed' message when I run Single GPU inference without quantization on Linux:

python inference.py --ckpt_dir $CKPT_DIR --tokenizer_path $TOKENIZER_PATH

Hi, I think the problem is the amount of memory usage if you don't apply the quantization of the model.

sskorol commented 1 year ago

I got the same while doing:

python3 -m llama.convert_llama --ckpt_dir $CKPT_DIR --tokenizer_path $TOKENIZER_PATH --model_size 65B --output_dir ./converted_meta_hf_65 --to hf --max_batch_size 4

[1]    16261 killed     python3 -m llama.convert_llama --ckpt_dir $CKPT_DIR --tokenizer_path   65B
sskorol commented 1 year ago

Ok, seems like I figured it out. It tries to load the whole model into RAM. I currently have 64Gb (62.7). Had to allocate 70Gb of swap to make it work. 🤣

Screenshot from 2023-04-09 14-37-06

Also note that you should have enough disk space to do conversion.

javierp183 commented 1 year ago

Ok, seems like I figured it out. It tries to load the whole model into RAM. I currently have 64Gb (62.7). Had to allocate 70Gb of swap to make it work. 🤣

Screenshot from 2023-04-09 14-37-06

Also note that you should have enough disk space to do conversion.

Could you tell ... How. ?thanks

sskorol commented 1 year ago

How to increase swap? Just follow this guide.