Failing Casual Language Modeling with Expanded Mistral-7B Model ( 1.75B Trainable Parameters)

System Info

I used the official optimum neuron AMI for Trainum. 

https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I attempted to train a block-expanded mistral (LLama pro style). I added a function to freeze all layers except for the newly added blocks.

Model Name - https://huggingface.co/arcee-ai/Mistral-7B-Instruct-v0.2-expanded Modified clm.py script to freeze the layers - https://github.com/arcee-ai/arcee-trainium-recipes/blob/main/model_training/optimum_neurone_hf/block_expanded_clm.py#L438 Parameters of the model: 8.99B Trainable Parameters of the model: 1.75B.

Although compilation was successful, training was unsuccessful due to the following errors:

It seems to be an OOM error on the device.

Please help us on this.

Expected behavior

0%|▎ | 1/498 [00:09<1:19:32, 9.60s/it]2024-Mar-22 11:54:25.793636 2087184:2087514 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.794143 2087184:2087514 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:25.794154 2087184:2087514 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:25.794170 2087184:2087514 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:25.794181 2087184:2087514 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:25.794203 2087184:2087514 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:25.794228 2087184:2087514 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request: 2024-Mar-22 11:54:25.794241 2087184:2087514 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8------------- 2024-Mar-22 11:54:25.794252 2087184:2087514 ERROR NRT:nrt_infodump NRT version: 2.20.11.0 (b7d33e68b9902cf258ef1b9712517514fad25763) 2024-Mar-22 11:54:25.794278 2087184:2087514 ERROR NRT:nrt_infodump CCOM version: 2.20.11.0-c101c322e940b1 (compat 36) 2024-Mar-22 11:54:25.794333 2087184:2087514 ERROR NRT:nrt_infodump Instance ID: i-0ff3b9d837253a002 2024-Mar-22 11:54:25.794350 2087184:2087514 ERROR NRT:nrt_infodump Cluster ID: N/A 2024-Mar-22 11:54:25.794362 2087184:2087514 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1053-aws #58~20.04.1-Ubuntu SMP Mon Jan 22 17:15:01 UTC 2024 2024-Mar-22 11:54:25.794372 2087184:2087514 ERROR NRT:nrt_infodump Nodename: ip-172-31-38-96 2024-Mar-22 11:54:25.794391 2087184:2087514 ERROR NRT:nrt_infodump Driver version: 2.15.9.0 el.MODULE_5934394564483 2024-Mar-22 11:54:25.794401 2087184:2087514 ERROR NRT:nrt_infodump Environment: 2024-Mar-22 11:54:25.794414 2087184:2087514 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=172.31.38.96:62182 2024-Mar-22 11:54:25.794424 2087184:2087514 ERROR NRT:nrt_infodump NEURON_USE_LOAD_COLLECTIVES=1 2024-Mar-22 11:54:25.794434 2087184:2087514 ERROR NRT:nrt_infodump NEURON_GLOBAL_DEVICE_ID=0 2024-Mar-22 11:54:25.794444 2087184:2087514 ERROR NRT:nrt_infodump NEURON_GLOBAL_DEVICE_COUNT=32 2024-Mar-22 11:54:25.794461 2087184:2087514 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 2024-Mar-22 11:54:25.794473 2087184:2087514 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------ 2024-Mar-22 11:54:25.798230 2087184:2087400 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.798254 2087184:2087400 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:25.798260 2087184:2087400 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:25.798268 2087184:2087400 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:25.798274 2087184:2087400 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:25.798283 2087184:2087400 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:25.793237 2087184:2087570 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.792699 2087184:2087458 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:26.142121 2087184:2087570 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:26.142711 2087184:2087458 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:26.149471 2087184:2087570 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:26.156897 2087184:2087458 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:26.163847 2087184:2087570 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:26.170944 2087184:2087458 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:26.177807 2087184:2087570 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:26.184848 2087184:2087458 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:26.192555 2087184:2087570 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:26.248905 2087184:2087458 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-03-22 11:54:26.269194: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.275794: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.280751: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.282478: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.318620: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.326587: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.329967: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.335937: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498211: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: 2024-03-22 11:54:26.498257: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Begin stack trace 2024-03-22 11:54:26.498261: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() 2024-03-22 11:54:26.498264: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const const>, absl::lts_20220623::Span<xla::Shape const const>) 2024-03-22 11:54:26.498268: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) 2024-03-22 11:54:26.498272: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498275: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&) 2024-03-22 11:54:26.498278: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498281: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498284: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498286: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone 2024-03-22 11:54:26.498289: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] End stack trace 2024-03-22 11:54:26.498292: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498295: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0: 2024-03-22 11:54:26.498298: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found. 2024-03-22 11:54:26.498301: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498304: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] 2024-03-22 11:54:26.498307: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G15]] 2024-03-22 11:54:26.498310: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498313: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] 2024-03-22 11:54:26.498316: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations. 2024-03-22 11:54:26.498319: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored. 2024-03-22 11:54:26.498322: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs: 2024-03-22 11:54:26.498325: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498329: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498333: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498338: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state Traceback (most recent call last): File "examples/language-modeling/run_clm.py", line 709, in main() File "examples/language-modeling/run_clm.py", line 657, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 1367, in train result = super().train( File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(args, kwargs) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 1028, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 417, in _maybe_log_save_evaluate tr_loss_scalar = tr_loss_scalar.detach().item() RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0: 2 root error(s) found. (0) INTERNAL: stream did not block host until done; was already in an error state [[{{node XRTExecute}}]] [[XRTExecute_G15]] (1) INTERNAL: stream did not block host until done; was already in an error state [[{{node XRTExecute}}]] 0 successful operations. 0 derived errors ignored. Recent warning and error logs: Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.514293: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: 2024-03-22 11:54:26.514338: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Begin stack trace 2024-03-22 11:54:26.514342: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() 2024-03-22 11:54:26.514346: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const const>, absl::lts_20220623::Span<xla::Shape const const>) 2024-03-22 11:54:26.514350: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) 2024-03-22 11:54:26.514353: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514356: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&) 2024-03-22 11:54:26.514359: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514362: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514365: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514368: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone 2024-03-22 11:54:26.514371: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] End stack trace ***

huggingface / optimum-neuron