huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
196 stars 59 forks source link

Failing Casual Language Modeling with Expanded Mistral-7B Model ( 1.75B Trainable Parameters) #528

Open shamanez opened 6 months ago

shamanez commented 6 months ago

System Info

I used the official optimum neuron AMI for Trainum. 

https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I attempted to train a block-expanded mistral (LLama pro style). I added a function to freeze all layers except for the newly added blocks.

Model Name - https://huggingface.co/arcee-ai/Mistral-7B-Instruct-v0.2-expanded Modified clm.py script to freeze the layers - https://github.com/arcee-ai/arcee-trainium-recipes/blob/main/model_training/optimum_neurone_hf/block_expanded_clm.py#L438 Parameters of the model: 8.99B Trainable Parameters of the model: 1.75B.

Although compilation was successful, training was unsuccessful due to the following errors:

It seems to be an OOM error on the device.

Please help us on this.

Expected behavior

0%|▎ | 1/498 [00:09<1:19:32, 9.60s/it]2024-Mar-22 11:54:25.793636 2087184:2087514 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.794143 2087184:2087514 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:25.794154 2087184:2087514 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:25.794170 2087184:2087514 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:25.794181 2087184:2087514 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:25.794203 2087184:2087514 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:25.794228 2087184:2087514 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request: 2024-Mar-22 11:54:25.794241 2087184:2087514 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8------------- 2024-Mar-22 11:54:25.794252 2087184:2087514 ERROR NRT:nrt_infodump NRT version: 2.20.11.0 (b7d33e68b9902cf258ef1b9712517514fad25763) 2024-Mar-22 11:54:25.794278 2087184:2087514 ERROR NRT:nrt_infodump CCOM version: 2.20.11.0-c101c322e940b1 (compat 36) 2024-Mar-22 11:54:25.794333 2087184:2087514 ERROR NRT:nrt_infodump Instance ID: i-0ff3b9d837253a002 2024-Mar-22 11:54:25.794350 2087184:2087514 ERROR NRT:nrt_infodump Cluster ID: N/A 2024-Mar-22 11:54:25.794362 2087184:2087514 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1053-aws #58~20.04.1-Ubuntu SMP Mon Jan 22 17:15:01 UTC 2024 2024-Mar-22 11:54:25.794372 2087184:2087514 ERROR NRT:nrt_infodump Nodename: ip-172-31-38-96 2024-Mar-22 11:54:25.794391 2087184:2087514 ERROR NRT:nrt_infodump Driver version: 2.15.9.0 el.MODULE_5934394564483 2024-Mar-22 11:54:25.794401 2087184:2087514 ERROR NRT:nrt_infodump Environment: 2024-Mar-22 11:54:25.794414 2087184:2087514 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=172.31.38.96:62182 2024-Mar-22 11:54:25.794424 2087184:2087514 ERROR NRT:nrt_infodump NEURON_USE_LOAD_COLLECTIVES=1 2024-Mar-22 11:54:25.794434 2087184:2087514 ERROR NRT:nrt_infodump NEURON_GLOBAL_DEVICE_ID=0 2024-Mar-22 11:54:25.794444 2087184:2087514 ERROR NRT:nrt_infodump NEURON_GLOBAL_DEVICE_COUNT=32 2024-Mar-22 11:54:25.794461 2087184:2087514 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 2024-Mar-22 11:54:25.794473 2087184:2087514 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------ 2024-Mar-22 11:54:25.798230 2087184:2087400 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.798254 2087184:2087400 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:25.798260 2087184:2087400 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:25.798268 2087184:2087400 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:25.798274 2087184:2087400 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:25.798283 2087184:2087400 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:25.793237 2087184:2087570 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:25.792699 2087184:2087458 ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 18, max allowed: 16). 2024-Mar-22 11:54:26.142121 2087184:2087570 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:26.142711 2087184:2087458 ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory 2024-Mar-22 11:54:26.149471 2087184:2087570 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:26.156897 2087184:2087458 ERROR TDRV:kbl_model_add copy_and_stage_mr() error 2024-Mar-22 11:54:26.163847 2087184:2087570 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:26.170944 2087184:2087458 ERROR NMGR:dlr_kelf_stage Failed to load subgraph 2024-Mar-22 11:54:26.177807 2087184:2087570 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:26.184848 2087184:2087458 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-0.json to NeuronCore 2024-Mar-22 11:54:26.192555 2087184:2087570 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-Mar-22 11:54:26.248905 2087184:2087458 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/f566ca79-9797-4099-8fc1-57840de6225b/model.MODULE_5934394564483856852+d41d8cd9.neff, err: 4 2024-03-22 11:54:26.269194: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.275794: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.280751: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.282478: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.318620: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.326587: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.329967: W tensorflow/compiler/xla/stream_executor/stream.cc:262] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.335937: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498211: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: 2024-03-22 11:54:26.498257: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Begin stack trace 2024-03-22 11:54:26.498261: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() 2024-03-22 11:54:26.498264: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const const>, absl::lts_20220623::Span<xla::Shape const const>) 2024-03-22 11:54:26.498268: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) 2024-03-22 11:54:26.498272: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498275: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&) 2024-03-22 11:54:26.498278: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498281: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498284: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498286: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone 2024-03-22 11:54:26.498289: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] End stack trace 2024-03-22 11:54:26.498292: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.498295: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0: 2024-03-22 11:54:26.498298: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found. 2024-03-22 11:54:26.498301: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498304: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] 2024-03-22 11:54:26.498307: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G15]] 2024-03-22 11:54:26.498310: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498313: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]] 2024-03-22 11:54:26.498316: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations. 2024-03-22 11:54:26.498319: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored. 2024-03-22 11:54:26.498322: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs: 2024-03-22 11:54:26.498325: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498329: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498333: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.498338: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state Traceback (most recent call last): File "examples/language-modeling/run_clm.py", line 709, in main() File "examples/language-modeling/run_clm.py", line 657, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 1367, in train result = super().train( File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(args, kwargs) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 1028, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum_neuron-0.0.20.dev0-py3.8.egg/optimum/neuron/trainers.py", line 417, in _maybe_log_save_evaluate tr_loss_scalar = tr_loss_scalar.detach().item() RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0: 2 root error(s) found. (0) INTERNAL: stream did not block host until done; was already in an error state [[{{node XRTExecute}}]] [[XRTExecute_G15]] (1) INTERNAL: stream did not block host until done; was already in an error state [[{{node XRTExecute}}]] 0 successful operations. 0 derived errors ignored. Recent warning and error logs: Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state Error blocking host until done in stream destructor: INTERNAL: stream did not block host until done; was already in an error state OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: stream did not block host until done; was already in an error state 2024-03-22 11:54:26.514293: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace: 2024-03-22 11:54:26.514338: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Begin stack trace 2024-03-22 11:54:26.514342: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace() 2024-03-22 11:54:26.514346: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const const>, absl::lts_20220623::Span<xla::Shape const const>) 2024-03-22 11:54:26.514350: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&) 2024-03-22 11:54:26.514353: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514356: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&) 2024-03-22 11:54:26.514359: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514362: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514365: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2024-03-22 11:54:26.514368: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone 2024-03-22 11:54:26.514371: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] End stack trace ***

dacorvo commented 6 months ago

Gently tagging @michaelbenayoun