Closed ashim95 closed 1 year ago
Hi @ashim95 , what are your system details? (# GPUs, GPU type, GPU memory)
To eval a 30B model in FP32, you will need at least 120GB of total memory across your GPUs just to store the weights.
You can reduce this requirement to 60GB by using the following edits, but the numerics will be slightly different (but I believe still safe for the LLaMa models):
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
precision: amp_fp16 # since LLaMa config.json reports `torch_dtype: float16`
On top of this, there will be additional activation memory which can be reduced via eval_batch_size
and reducing max_seq_len
. Personally, I think max_seq_len: 2048
is a bit overkill since the samples in boolq
are not that big -- you are basically going to be padding up to 2048 for each sample. If you know the characteristics of your eval dataset (e.g. if you know all samples are < 512 tokens) you can safely reduce max_seq_len
.
Thank you @abhi-mosaic for responding.
I have access to a node with 8 A100s with 80GB GPU RAM each. I quickly ran tests with the following configurations on llama-30b
:
FSDP Config, Precision | Seq Length | Batch Size | Does it run? |
---|---|---|---|
FULL, fp32 | 2048 | 1 | no (OOM) |
FULL, fp32 | 512 | 1 | no (OOM) |
PURE, amp_bf16 | 2048 | 1 | no (OOM) |
PURE, amp_bf16 | 512 | 1 | no (OOM) |
PURE, amp_bf16 | 256 | 1 | no (OOM) |
PURE, amp_bf16 | 128 | 1 | no (OOM) |
PURE, amp_bf16 | 64 | 1 | no (OOM) |
PURE, amp_bf16 | 2 | 1 | no (OOM) |
Also, as an aside, I tried using the lm-evaluation-harness toolkit for evaluating the llama-30b
and I was able to run inference with the model on a single A100 80GB gpu (although the problem with their repo is that results are generally worse - with this model I get 86.39% accuracy but much worse than reported for smaller llama variants).
Let me know if there's anything else I should try.
Thanks,
Hi @ashim95 , thank you for the table of results. This is unexpected! I am able to run eval on huggyllama/llama-30b
on 8xA100-80GB using the command and parameters:
$ composer eval/eval.py parameters.yaml
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: '
max_seq_len: 2048
model:
device: cpu
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: huggyllama/llama-30b
tokenizer:
kwargs:
model_max_length: 2048
name: huggyllama/llama-30b
device_eval_batch_size: 8
fsdp_config:
mixed_precision: FULL
sharding_strategy: FULL_SHARD
precision: amp_fp16
seed: 1
Could you confirm whether you are prepending your command with composer
rather than python
? The former is a launcher script that will run N processes on your system for N GPUs. This is necessary to use all GPUs and enable FSDP sharding. The latter will just run single-process on a single GPU.
@abhi-mosaic Thanks for responding.
Using composer
did the trick for the full precision model. For anyone running it in future, on boolq
in 0-shot, I get the following accuracy:
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.8383252024650574
Now, I can't seem to run it with half-precision though. I have the following config file
max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-30b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 8
precision: amp_fp16
# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
I ran it with the command: composer eval/eval.py eval/yamls/llama_30b_eval_new_half_precision.yaml
on 8 A100s with 80GB. I get the following error:
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
ERROR:composer.cli.launcher:Rank 5 crashed with exit code -9.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 115898) exited with code 143
Global rank 1 (PID 115899) exited with code 143
----------Begin global rank 1 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
----------End global rank 1 STDERR----------
Global rank 2 (PID 115900) exited with code 143
----------Begin global rank 2 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 2 STDOUT----------
----------Begin global rank 2 STDERR----------
----------End global rank 2 STDERR----------
Global rank 3 (PID 115901) exited with code 143
----------Begin global rank 3 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------
----------End global rank 3 STDERR----------
Global rank 4 (PID 115902) exited with code 143
----------Begin global rank 4 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 4 STDOUT----------
----------Begin global rank 4 STDERR----------
----------End global rank 4 STDERR----------
Global rank 6 (PID 115904) exited with code 143
----------Begin global rank 6 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 6 STDOUT----------
----------Begin global rank 6 STDERR----------
----------End global rank 6 STDERR----------
Global rank 7 (PID 115905) exited with code 143
----------Begin global rank 7 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
----------End global rank 7 STDOUT----------
----------Begin global rank 7 STDERR----------
----------End global rank 7 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 115898) exited with code 143
Is there anything obvious I am doing wrong?
Thanks,
Great to hear that the precision: fp32
run worked!
I can't see anything obviously wrong with your YAML, but here is one that is working for me with both torch 1.13 and torch 2:
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: '
max_seq_len: 2048
model_name_or_path: huggyllama/llama-30b
model:
device: cpu
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: ${model_name_or_path}
tokenizer:
kwargs:
model_max_length: ${max_seq_len}
name: ${model_name_or_path}
device_eval_batch_size: 8
fsdp_config:
mixed_precision: FULL
sharding_strategy: FULL_SHARD
precision: amp_fp16
seed: 1
Closing as stale
❓ Question
Hi,
I am trying to run zero-shot evaluation for the 30 billion
llama-30b
. Even for abatch_size = 1
, I am getting atorch.cuda.OutOfMemoryError: CUDA out of memory
. Following is my config file:One alternative is to reduce the maximum sequence length, but I don't want to do that since all the smaller models are run with 2048.
Does the toolkit support multi-gpu inference? Perhaps I need to change something with the FSDP config?
Thank you,