recompilation of llama 3 70b : Estimated peak HBM usage (19.000523) exceeds 16GB. Neff won't be able to load on chip

@philschmid

big fan of your work.

Referring - https://www.philschmid.de/inferentia2-llama3-70b

Trying to recompile the llama3 70b with different parameters on an inferentia inf2.48xlarge machine.

Most of process went well until I hit -

2024-06-01 03:27:57.000941: 7571 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/root/neuroncc_compile_workdir/476cf8e5-17ac-4b1e-9e9c-05e34e9ddda2/model.MODULE_9450abe5705d16fa70fb+2c2d707e.hlo_module.pb', '--output', '/tmp/root/neuroncc_compile_workdir/476cf8e5-17ac-4b1e-9e9c-05e34e9ddda2/model.MODULE_9450abe5705d16fa70fb+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-06-01T03:27:57Z [XCG815] Estimated peak HBM usage (19.000523) exceeds 16GB. Neff won't be able to load on chip - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new

2024-06-01 03:27:57.000942: 7571 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/root/neuroncc_compile_workdir/476cf8e5-17ac-4b1e-9e9c-05e34e9ddda2/model.MODULE_9450abe5705d16fa70fb+2c2d707e.hlo_module.pb after 0 retries.

(aws_neuron_venv_pytorch) root@ip-172-31-41-35:/home/ubuntu# ps -ef | grep 14689 root 14689 3248 7 07:48 pts/1 00:00:15 /home/ubuntu/aws_neuron_venv_pytorch/bin/python /home/ubuntu/aws_neuron_venv_pytorch/bin/optimum-cli export neuron --task text-generation --model meta-llama/Meta-Llama-3-70B-Instruct --batch_size 8 --dynamic-batch-size --sequence_length 1200 --auto_cast_type bf16 --num_cores 12 llama3_neuron_summary/

considering inf2.xlarge has 12 cores , and my application needs a shorter sequence length but bigger number of batch size 8 , I am trying to recompile but :

since aws's biggest inferentia instance is 48xlarge and my recompilation is failing there, I dont have a way to do it and I am stuck.
Did you ever recompile llama3 70b if so on which host you did.
general question : the already compiled neuron caches documentation shows ,

https://huggingface.co/aws-neuron/optimum-neuron-cache/blob/main/inference-cache-config/llama3.json "meta-llama/Meta-Llama-3-70B": [ { "batch_size": 1, "sequence_length": 4096, "num_cores": 24, "auto_cast_type": "fp16" }, { "batch_size": 4, "sequence_length": 4096, "num_cores": 24, "auto_cast_type": "fp16" } ]

num_cores as 24, but inf2.xlarge48 has only 12 cores, then how come its compiled to work for 24?

Please unblock me by suggesting :

how can I recompile the llama 3 70b on aws inferentia instances.

philschmid / huggingface-inferentia2-samples

recompilation of llama 3 70b : Estimated peak HBM usage (19.000523) exceeds 16GB. Neff won't be able to load on chip #12