Multi GPU inference - Githubissues

ashim95 commented 1 year ago

❓ Question

Hi,

I am trying to run zero-shot evaluation for the 30 billion llama-30b. Even for a batch_size = 1, I am getting a torch.cuda.OutOfMemoryError: CUDA out of memory. Following is my config file:

max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-30b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 1
precision: fp32

# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

One alternative is to reduce the maximum sequence length, but I don't want to do that since all the smaller models are run with 2048.

Does the toolkit support multi-gpu inference? Perhaps I need to change something with the FSDP config?

Thank you,

abhi-mosaic commented 1 year ago

Hi @ashim95 , what are your system details? (# GPUs, GPU type, GPU memory)

To eval a 30B model in FP32, you will need at least 120GB of total memory across your GPUs just to store the weights.

You can reduce this requirement to 60GB by using the following edits, but the numerics will be slightly different (but I believe still safe for the LLaMa models):

fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE

precision: amp_fp16 # since LLaMa config.json reports `torch_dtype: float16`

On top of this, there will be additional activation memory which can be reduced via eval_batch_size and reducing max_seq_len. Personally, I think max_seq_len: 2048 is a bit overkill since the samples in boolq are not that big -- you are basically going to be padding up to 2048 for each sample. If you know the characteristics of your eval dataset (e.g. if you know all samples are < 512 tokens) you can safely reduce max_seq_len.

ashim95 commented 1 year ago

Thank you @abhi-mosaic for responding.

I have access to a node with 8 A100s with 80GB GPU RAM each. I quickly ran tests with the following configurations on llama-30b:

FSDP Config, Precision	Seq Length	Batch Size	Does it run?
FULL, fp32	2048	1	no (OOM)
FULL, fp32	512	1	no (OOM)
PURE, amp_bf16	2048	1	no (OOM)
PURE, amp_bf16	512	1	no (OOM)
PURE, amp_bf16	256	1	no (OOM)
PURE, amp_bf16	128	1	no (OOM)
PURE, amp_bf16	64	1	no (OOM)
PURE, amp_bf16	2	1	no (OOM)

Also, as an aside, I tried using the lm-evaluation-harness toolkit for evaluating the llama-30b and I was able to run inference with the model on a single A100 80GB gpu (although the problem with their repo is that results are generally worse - with this model I get 86.39% accuracy but much worse than reported for smaller llama variants).

Let me know if there's anything else I should try.

Thanks,

abhi-mosaic commented 1 year ago

Hi @ashim95 , thank you for the table of results. This is unexpected! I am able to run eval on huggyllama/llama-30b on 8xA100-80GB using the command and parameters:

$ composer eval/eval.py parameters.yaml

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: '

max_seq_len: 2048

model:
  device: cpu
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: huggyllama/llama-30b

tokenizer:
  kwargs:
    model_max_length: 2048
  name: huggyllama/llama-30b

device_eval_batch_size: 8

fsdp_config:
  mixed_precision: FULL
  sharding_strategy: FULL_SHARD

precision: amp_fp16

seed: 1

Could you confirm whether you are prepending your command with composer rather than python? The former is a launcher script that will run N processes on your system for N GPUs. This is necessary to use all GPUs and enable FSDP sharding. The latter will just run single-process on a single GPU.

ashim95 commented 1 year ago

@abhi-mosaic Thanks for responding.

Using composer did the trick for the full precision model. For anyone running it in future, on boolq in 0-shot, I get the following accuracy:

metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.8383252024650574

Now, I can't seem to run it with half-precision though. I have the following config file

max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-30b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 8
precision: amp_fp16

# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

I ran it with the command: composer eval/eval.py eval/yamls/llama_30b_eval_new_half_precision.yaml on 8 A100s with 80GB. I get the following error:

{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}
ERROR:composer.cli.launcher:Rank 5 crashed with exit code -9.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 115898) exited with code 143
Global rank 1 (PID 115899) exited with code 143
----------Begin global rank 1 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------

----------End global rank 1 STDERR----------
Global rank 2 (PID 115900) exited with code 143
----------Begin global rank 2 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 2 STDOUT----------
----------Begin global rank 2 STDERR----------

----------End global rank 2 STDERR----------
Global rank 3 (PID 115901) exited with code 143
----------Begin global rank 3 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------

----------End global rank 3 STDERR----------
Global rank 4 (PID 115902) exited with code 143
----------Begin global rank 4 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 4 STDOUT----------
----------Begin global rank 4 STDERR----------

----------End global rank 4 STDERR----------
Global rank 6 (PID 115904) exited with code 143
----------Begin global rank 6 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 6 STDOUT----------
----------Begin global rank 6 STDERR----------

----------End global rank 6 STDERR----------
Global rank 7 (PID 115905) exited with code 143
----------Begin global rank 7 STDOUT----------
{'max_seq_len': 2048, 'seed': 1, 'model_name_or_path': 'huggyllama/llama-30b', 'tokenizer': {'name': '${model_name_or_path}', 'kwargs': {'model_max_length': '${max_seq_len}'}}, 'model': {'name': 'hf_causal_lm', 'pretrained_model_name_or_path': '${model_name_or_path}', 'init_device': 'cpu', 'pretrained': True}, 'load_path': None, 'device_eval_batch_size': 8, 'precision': 'amp_fp16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'FULL'}, 'icl_tasks': [{'label': 'boolq', 'dataset_uri': 'eval/local_data/boolq.jsonl', 'num_fewshot': [0], 'icl_task_type': 'multiple_choice', 'continuation_delimiter': 'Answer: '}]}

----------End global rank 7 STDOUT----------
----------Begin global rank 7 STDERR----------

----------End global rank 7 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 115898) exited with code 143

Is there anything obvious I am doing wrong?

Thanks,

abhi-mosaic commented 1 year ago

Great to hear that the precision: fp32 run worked!

I can't see anything obviously wrong with your YAML, but here is one that is working for me with both torch 1.13 and torch 2:

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: '

max_seq_len: 2048
model_name_or_path: huggyllama/llama-30b

model:
  device: cpu
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: ${model_name_or_path}

tokenizer:
  kwargs:
    model_max_length: ${max_seq_len}
  name: ${model_name_or_path}

device_eval_batch_size: 8

fsdp_config:
  mixed_precision: FULL
  sharding_strategy: FULL_SHARD

precision: amp_fp16

seed: 1

abhi-mosaic commented 1 year ago

Closing as stale

mosaicml / llm-foundry

Multi GPU inference #293

❓ Question