"List index out of range" error when using BART-base summarization model

njbrake commented 1 year ago

Hi,

when running training of BART-base on a trn1.2xlarge using the command

 torchrun --nproc_per_node=2 run_summarization.py

I receive the below compilation error, I am wondering if you have any tips you might be able to provide about how to understand what is going wrong? From the output I can't tell whether it's an error on the optimum-neuron side, or the neuron-sdk (neuronx-cc) side.


.......03/17/2023 01:52:37 PM ERROR 3541 [Tensorizer]: Transformation error on operator: _scatter.4
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: ***************************************************************
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:  An Internal Compiler Error has occurred
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: ***************************************************************
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: 
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Error message:  list index out of range
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: 
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Error class:    IndexError
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Error location: Unknown
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Command line:   /opt/aws_neuron_venv_pytorch/bin/neuronx-cc --target=trn1 compile --framework XLA /tmp/MODULE_1_SyncTensorsGraph.30344_9843381109720227859.hlo.pb --output /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_9843381109720227859/MODULE_1_SyncTensorsGraph.30344_9843381109720227859/87416364-87e0-42b6-b9e8-ee59db979679/MODULE_1_SyncTensorsGraph.30344_9843381109720227859.neff --verbose=35
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: 
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Internal details:
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 235, in neuronxcc.driver.CommandDriver.CommandDriver.run
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1014, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 965, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 990, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 994, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 591, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 387, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 168, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 243, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 244, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 266, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 183, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 185, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 222, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 432, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 440, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 142, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runOnFunction
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 196, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 178, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 208, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 210, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 211, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 241, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 334, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 329, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 318, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 366, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 369, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 356, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmt
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/sunda/passes/SundaISel.py", line 549, in neuronxcc.starfish.penguin.targets.sunda.passes.SundaISel.SundaISel.transformTGenericAtomicRMWOperator
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 174, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodegenBase.codegenMacro
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 200, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodegenBase.codegenInsts
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 225, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.codegenInst
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 214, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.dispatch_codegen
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/sunda/passes/SundaISel.py", line 408, in neuronxcc.starfish.penguin.targets.sunda.passes.SundaISel.ScatterCodegen.codegenGenericAtomicRMW
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 321, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.codegenGenericAccess
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: 
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Version information:
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   NeuronX Compiler version 2.4.0.21+b7621be18
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   HWM version 2.4.0.1-90172456c
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   NEFF version Dynamic
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   TVM not available
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   NumPy version 1.20.0
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]:   MXNet not available
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: 
03/17/2023 01:52:37 PM ERROR 3541 [neuronx-cc]: Artifacts stored in: /home/ssm-user/scm/scribe_summarization_huggingface/data/neuronxcc-rcxp7i5p

Compiler status ERROR
2023-03-17 13:52:38.000614: ERROR ||NCC_WRAPPER||: There was a compilation error for /tmp/MODULE_1_SyncTensorsGraph.30344_9843381109720227859.hlo.pb graph. Returning with an errored graph
2023-03-17 13:52:39.090977: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.099414: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  0%|                                                                                                                                                                                              | 2/10068 [02:19<228:11:10, 81.61s/it]2023-03-17 13:52:39.262232: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-17 13:52:39.262272: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-17 13:52:39.262283: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        tsl::CurrentStackTrace()
2023-03-17 13:52:39.262293: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-17 13:52:39.262300: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-17 13:52:39.262322: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.262331: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-03-17 13:52:39.262337: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.262345: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.262354: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.262362: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        clone
2023-03-17 13:52:39.262370: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-17 13:52:39.262378: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.262386: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-03-17 13:52:39.262395: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-17 13:52:39.262403: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.262411: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 13:52:39.262419: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[XRTExecute_G15]]
2023-03-17 13:52:39.262427: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.262436: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 13:52:39.262444: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-17 13:52:39.262452: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-17 13:52:39.262460: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-17 13:52:39.262469: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
  File "/home/ssm-user/scm/scribe_summarization_huggingface/src/hf_summ/train.py", line 269, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1765, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 951, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
         [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../src/run_summarization.py", line 15, in <module>
    train()
  File "/home/ssm-user/scm/scribe_summarization_huggingface/src/hf_summ/train.py", line 374, in train
    return results
UnboundLocalError: local variable 'results' referenced before assignment
2023-03-17 13:52:39.277978: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-17 13:52:39.278010: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-17 13:52:39.278020: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        tsl::CurrentStackTrace()
2023-03-17 13:52:39.278030: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-17 13:52:39.278040: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-17 13:52:39.278046: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.278055: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-03-17 13:52:39.278064: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.278072: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.278081: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.278090: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        clone
2023-03-17 13:52:39.278098: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-17 13:52:39.278106: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 13:52:39.278116: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-03-17 13:52:39.278125: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-17 13:52:39.278134: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.278142: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 13:52:39.278151: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[XRTExecute_G15]]
2023-03-17 13:52:39.278160: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.278169: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 13:52:39.278177: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-17 13:52:39.278186: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-17 13:52:39.278195: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-17 13:52:39.278204: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-03-17 13:52:39.278213: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
  File "/home/ssm-user/scm/scribe_summarization_huggingface/src/hf_summ/train.py", line 269, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1765, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 951, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
         [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../src/run_summarization.py", line 15, in <module>
    train()
  File "/home/ssm-user/scm/scribe_summarization_huggingface/src/hf_summ/train.py", line 374, in train
    return results
UnboundLocalError: local variable 'results' referenced before assignment
  0%|                                                                                                                                                                                              | 2/10068 [02:20<195:46:22, 70.02s/it]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1585) of binary: /opt/aws_neuron_venv_pytorch/bin/python3.8
Traceback (most recent call last):
  File "/opt/aws_neuron_venv_pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../src/run_summarization.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-17_13:52:43
  host      : ip-10-255-154-196.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1586)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-17_13:52:43
  host      : ip-10-255-154-196.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1585)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(aws_neuron_venv_pytorch) [ssm-user@ip-10-255-154-196 data]$

philschmid commented 1 year ago

Hello @njbrake,

Thank you for opening the issue! in the logs i can see train.py being used and in the beginning you are saying you are using the run_summarization.py example. Did you copy it or is the train.py another script? Did you pass any hyperparameter along? , e.g. model id, dataset, batchsize etc.

michaelbenayoun commented 1 year ago

Hi @njbrake, I see the error being:

UnboundLocalError: local variable 'results' referenced before assignment

But from experience, what could be happening is that you are using a batch size that is too big. What is the per-device-batch-size?

Here some tips:

Use the -bf16 flag to perform training in BF16, unless it gives bad accuracy, it is the recommended way to go on Trainium, it will allow you to use a bigger batch size.
The batch size choice will be dependent on the sequence length you are dealing with. For summarization, the source sequence length is usually big (1024) so I would recommend using a small per-device batch size, like 2 (maybe 3-4 can fit)

njbrake commented 1 year ago

Hi @michaelbenayoun and @philschmid, Thanks for the fast response!

sorry that "local variable 'results'" error was irrelevant, it was an error that came up because of the error that was thrown by neuron.

the run_summarization.py script is a simple wrapper that calls the train.py script for my case.

I tried adding the bf16 flag but that didn't help either. here are my hyperparameters.

{
    "model_name_or_path": "facebook/bart-large",
    "dataset_name" : "my_dataset",
    "save_predictions" : false,
    "do_train": true,
    "do_eval": true,
    "do_predict": true,
    "max_source_length": 1024,
    "max_seq_length": 1024,
    "max_target_length": 1024,
    "predict_with_generate": true,
    "num_beams": 5,
    "generation_num_beams": 5,
    "generation_max_length": 512,
    "generation_no_repeat_ngram_size": 3,
    "label_smoothing_factor": 0.1,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "num_train_epochs": 2,
    "warmup_steps": 200,
    "learning_rate": 4e-5,
    "lr_scheduler_type": "polynomial",
    "weight_decay": 0.0001,
    "max_grad_norm": 0.1,
    "evaluation_strategy": "steps",
    "logging_strategy": "steps",
    "save_strategy": "steps",
    "eval_steps": 100,
    "logging_steps": 100,
    "save_steps": 100,
    "load_best_model_at_end": true,
    "metric_for_best_model": "rougeL",
    "early_stopping_patience": 5,
    "save_total_limit": 3,
    "pad_to_multiple_of": 8,
    "pad_to_max_length": true,
    "gradient_checkpointing": false,
    "fp16": false,
    "bf16": true
}

I tried with setting bf16 to false, both produced that same error. Oh, and here is my requirements.txt that I pulled with a pip freeze

absl-py==1.4.0
aiobotocore==2.4.2
aiofiles==22.1.0
aiohttp==3.8.4
aioitertools==0.11.0
aiosignal==1.3.1
aiosqlite==0.18.0
alembic==1.10.2
amqp==5.1.1
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
astroid==2.15.0
asttokens==2.2.1
async-timeout==4.0.2
attrs==22.2.0
Automat==22.10.0
aws-neuronx-runtime-discovery==2.9
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.11.2
billiard==3.6.4.0
black==23.1.0
bleach==6.0.0
boto3==1.26.91
botocore==1.27.59
build==0.10.0
cachetools==5.3.0
celery==5.2.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
cloud-tpu-client==0.10
cloudpickle==2.2.1
cmake==3.25.2
colorama==0.4.4
coloredlogs==15.0.1
comm==0.1.2
constantly==15.1.0
contourpy==1.0.7
cryptography==39.0.2
cssselect==1.2.0
cycler==0.11.0
dask==2023.3.1
databricks-cli==0.17.5
datasets==2.10.1
debugpy==1.6.6
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
distlib==0.3.6
docker==6.0.1
docutils==0.16
dparse==0.6.2
entrypoints==0.4
exceptiongroup==1.1.1
executing==1.2.0
fastapi==0.94.1
fastjsonschema==2.16.3
filelock==3.9.1
Flask==2.2.3
fonttools==4.39.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.3.0
gitdb==4.0.10
GitPython==3.1.31
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.16.2
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.58.0
greenlet==2.0.2
grpcio==1.51.3
gunicorn==20.1.0
httpie==3.2.1
httplib2==0.21.0
huggingface-hub==0.13.2
humanfriendly==10.0
hyperlink==21.0.0
idna==3.4
imageio==2.26.0
importlib-metadata==6.0.0
importlib-resources==5.12.0
incremental==22.10.0
iniconfig==2.0.0
ipykernel==6.21.3
ipython==8.11.0
ipython-genutils==0.2.0
ipywidgets==8.0.4
islpy==2022.1.1
isoduration==20.11.0
isort==5.12.0
itemadapter==0.7.0
itemloaders==1.0.6
itsdangerous==2.1.2
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.3
jupyter_client==8.0.3
jupyter_core==5.2.0
jupyter_server==2.4.0
jupyter_server_fileid==0.8.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.6.1
jupyterlab==3.6.1
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.5
jupyterlab_server==2.20.0
kiwisolver==1.4.4
kombu==5.2.4
lazy-object-proxy==1.9.0
libneuronxla==0.5.144
llvmlite==0.39.1
locket==1.0.0
lockfile==0.12.2
lxml==4.9.2
Mako==1.2.4
Markdown==3.4.1
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==2.0.5
mlflow==2.2.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
nbclassic==0.5.3
nbclient==0.7.2
nbconvert==7.2.10
nbformat==5.7.3
nest-asyncio==1.5.6
networkx==2.6.3
neuronx-cc==2.4.0.21+b7621be18
neuronx-hwm==2.4.0.1+90172456c
nltk==3.8.1
notebook==6.5.3
notebook_shim==0.2.2
numba==0.56.4
numpy==1.20.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
oauthlib==3.2.2
opencv-python==4.7.0.72
optimum==1.7.1
optimum-neuron @ git+https://github.com/huggingface/optimum-neuron.git@4ef49022bda9136205140ed11eb5b877d056a017
packaging==23.0
pandas==1.4.4
pandocfilters==1.5.0
parsel==1.7.0
parso==0.8.3
partd==1.3.0
pathspec==0.11.1
pexpect==4.8.0
pgzip==0.3.4
pickleshare==0.7.5
Pillow==9.4.0
pip-tools==6.12.3
pipenv==2023.2.4
pkgutil_resolve_name==1.3.10
platformdirs==3.1.1
plotly==5.13.1
pluggy==1.0.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
Protego==0.2.1
protobuf==3.20.2
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==11.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.6
PyDispatcher==2.0.7
Pygments==2.14.0
PyJWT==2.6.0
pylint==2.17.0
pyOpenSSL==23.0.0
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyrsistent==0.19.3
PySocks==1.7.1
pytest==7.2.2
python-daemon==3.0.1
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2022.7.1
PyYAML==5.4.1
pyzmq==25.0.1
querystring-parser==1.2.4
queuelib==1.6.2
regex==2022.10.31
requests==2.28.2
requests-file==1.5.1
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
requests-unixsocket==0.3.0
responses==0.18.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.3.2
rouge-score==0.1.2
rsa==4.7.2
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.7
s3fs==2023.3.0
s3transfer==0.6.0
scikit-learn==1.2.2
scipy==1.3.3
Scrapy==2.8.0
seaborn==0.12.2
Send2Trash==1.8.0
sentencepiece==0.1.97
service-identity==21.1.0
shap==0.41.0
six==1.16.0
sklearn==0.0.post1
slicer==0.0.7
smmap==5.0.0
sniffio==1.3.0
soupsieve==2.4
SQLAlchemy==2.0.6
sqlparse==0.4.3
stack-data==0.6.2
starlette==0.26.1
sympy==1.11.1
tabulate==0.9.0
tenacity==8.2.2
tensorboard==2.12.0
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tldextract==3.4.0
tokenizers==0.13.2
toml==0.10.2
tomli==2.0.1
tomlkit==0.11.6
toolz==0.12.0
torch==1.13.1
torch-neuronx==1.13.0.1.5.0
torch-xla==1.13.0+torchneuron4
torchvision==0.14.0
tornado==6.2
tqdm==4.65.0
traitlets==5.9.0
transformers==4.26.1
Twisted==22.10.0
typing_extensions==4.5.0
uri-template==1.2.0
uritemplate==3.0.1
urllib3==1.26.15
vine==5.0.0
virtualenv==20.21.0
virtualenv-clone==0.5.7
w3lib==2.1.1
wcwidth==0.2.6
webcolors==1.12
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.3
widgetsnbextension==4.0.5
wrapt==1.15.0
xxhash==3.2.0
y-py==0.5.9
yarl==1.8.2
ypy-websocket==0.8.2
zipp==3.15.0
zope.interface==5.5.2

michaelbenayoun commented 1 year ago

Alright, the sequence length is too big, since you are running a summarization task, do you need it to be as big as 1024 for the target length?

By default, we have max_target_length=128. Keep bf16=True as well. It will be the defautl in next releases.

Also, we do not support generation yet, so only the loss is computed.

njbrake commented 1 year ago

Oh awesome! I’ll do that. Thanks for the support on this 💯 . Any idea when generation might be supported? That’s pretty important for my team that has factuality based eval metrics that rely upon generation.

njbrake commented 1 year ago

Ok, I adjusted the max_target_length and unfortunately I'm still getting the vague "list index out of range" error 😢

  0%|                                                                                                                                                               | 1/10068 [00:03<11:00:33,  3.94s/it]2023-03-17 16:56:36.000674: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_15278458310435703474.
2023-03-17 16:56:36.000675: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_15278458310435703474/MODULE_1_SyncTensorsGraph.30530_15278458310435703474_ip-10-255-154-196.ec2.internal-5c19433d-8386-5f71b75d4b1c5/6b211c67-7520-42da-aec8-4c19a88e157d
.......03/17/2023 04:58:44 PM ERROR 9835 [Tensorizer]: Transformation error on operator: _scatter.4
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: ***************************************************************
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:  An Internal Compiler Error has occurred
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: ***************************************************************
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: 
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Error message:  list index out of range
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: 
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Error class:    IndexError
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Error location: Unknown
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Command line:   /opt/aws_neuron_venv_pytorch/bin/neuronx-cc --target=trn1 compile --framework XLA /tmp/MODULE_1_SyncTensorsGraph.30530_15278458310435703474_ip-10-255-154-196.ec2.internal-5c19433d-8386-5f71b75d4b1c5.hlo.pb --output /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_15278458310435703474/MODULE_1_SyncTensorsGraph.30530_15278458310435703474_ip-10-255-154-196.ec2.internal-5c19433d-8386-5f71b75d4b1c5/6b211c67-7520-42da-aec8-4c19a88e157d/MODULE_1_SyncTensorsGraph.30530_15278458310435703474_ip-10-255-154-196.ec2.internal-5c19433d-8386-5f71b75d4b1c5.neff --verbose=35
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: 
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Internal details:
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 235, in neuronxcc.driver.CommandDriver.CommandDriver.run
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1014, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 965, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 990, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 994, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 591, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 387, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 168, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 243, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 244, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 266, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 183, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 185, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 222, in neuronxcc.starfish.penguin.Compile.compile_cu
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 432, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 440, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 142, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runOnFunction
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 196, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 178, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 208, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 210, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 211, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 241, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 334, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 329, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 318, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 366, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 369, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/sunda/passes/SundaISel.py", line 549, in neuronxcc.starfish.penguin.targets.sunda.passes.SundaISel.SundaISel.transformTGenericAtomicRMWOperator
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 174, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodegenBase.codegenMacro
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 200, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodegenBase.codegenInsts
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 225, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.codegenInst
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 214, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.dispatch_codegen
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/sunda/passes/SundaISel.py", line 408, in neuronxcc.starfish.penguin.targets.sunda.passes.SundaISel.ScatterCodegen.codegenGenericAtomicRMW
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/passes/TongaISel.py", line 321, in neuronxcc.starfish.penguin.targets.tonga.passes.TongaISel.CodeGen.codegenGenericAccess
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: 
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Version information:
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   NeuronX Compiler version 2.4.0.21+b7621be18
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   HWM version 2.4.0.1-90172456c
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   NEFF version Dynamic
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   TVM not available
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   NumPy version 1.20.0
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]:   MXNet not available
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: 
03/17/2023 04:58:44 PM ERROR 9835 [neuronx-cc]: Artifacts stored in: /home/ssm-user/scm/scribe_summarization_huggingface/data/neuronxcc-58zb37y4

Compiler status ERROR
2023-03-17 16:58:45.000295: ERROR ||NCC_WRAPPER||: There was a compilation error for /tmp/MODULE_1_SyncTensorsGraph.30530_15278458310435703474_ip-10-255-154-196.ec2.internal-5c19433d-8386-5f71b75d4b1c5.hlo.pb graph. Returning with an errored graph
2023-03-17 16:58:45.665131: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-03-17 16:58:45.665283: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  0%|                                                                                                                                                              | 2/10068 [02:14<219:17:59, 78.43s/it]2023-03-17 16:58:45.836788: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-03-17 16:58:45.836831: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-03-17 16:58:45.836836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        tsl::CurrentStackTrace()
2023-03-17 16:58:45.836843: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-03-17 16:58:45.836847: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-03-17 16:58:45.836852: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 16:58:45.836855: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-03-17 16:58:45.836863: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 16:58:45.836867: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 16:58:45.836871: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 16:58:45.836874: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]        clone
2023-03-17 16:58:45.836877: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-03-17 16:58:45.836881: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-03-17 16:58:45.836884: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-03-17 16:58:45.836888: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-03-17 16:58:45.836895: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-03-17 16:58:45.836901: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 16:58:45.836905: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[XRTExecute_G15]]
2023-03-17 16:58:45.836909: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-03-17 16:58:45.836912: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]         [[{{node XRTExecute}}]]
2023-03-17 16:58:45.836916: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-03-17 16:58:45.836919: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-03-17 16:58:45.836923: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-03-17 16:58:45.836926: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Traceback (most recent call last):
  File "/home/ssm-user/scm/scribe_summarization_huggingface/src/hf_summ/train.py", line 269, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/transformers/trainer.py", line 1765, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/opt/aws_neuron_venv_pytorch/lib64/python3.8/site-packages/torch_xla/core/xla_model.py", line 951, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
         [[XRTExecute_G15]]
  (1) INTERNAL: neuronx-cc compilation failed.
         [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.

Here's my new hyperparameters

{
    "model_name_or_path": "facebook/bart-base",
    "dataset_name" : "my_dataset",
    "save_predictions" : false,
    "do_train": true,
    "do_eval": true,
    "do_predict": true,
    "max_source_length": 1024,
    "max_target_length": 128,
    "attention_dropout": 0.1,
    "predict_with_generate": false,
    "num_beams": 5,
    "generation_num_beams": 5,
    "generation_max_length": 512,
    "generation_no_repeat_ngram_size": 3,
    "label_smoothing_factor": 0.1,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "num_train_epochs": 2,
    "warmup_steps": 200,
    "learning_rate": 4e-5,
    "lr_scheduler_type": "polynomial",
    "weight_decay": 0.0001,
    "max_grad_norm": 0.1,
    "evaluation_strategy": "steps",
    "logging_strategy": "steps",
    "save_strategy": "steps",
    "eval_steps": 100,
    "logging_steps": 100,
    "save_steps": 100,
    "load_best_model_at_end": true,
    "metric_for_best_model": "rougeL",
    "early_stopping_patience": 5,
    "save_total_limit": 3,
    "pad_to_multiple_of": 8,
    "pad_to_max_length": true,
    "gradient_checkpointing": false,
    "bf16": true
}

Do you happen to have any other pointers on what to look for? I'm having a hard time understanding how to extract any meaningful information from the error and stack trace that neuronxcc is providing

michaelbenayoun commented 1 year ago

I just ran the following command:

python examples/summarization/run_summarization.py --model_name_or_path facebook/bart-base --dataset_name cnn_dailymail --dataset_config_name 3.0.0 --do_train --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --max_source_length 1024 --max_target_length 128 --overwrite_output_dir --output_dir test_bart_base --bf16

And it works nicely.

I agree that the error message is not helpful at all, from my experience the issue is usually from not having enough memory.

Could you try this:

max_source_length = 100
max_target_length = 100

Just to check if this is memory related?

njbrake commented 1 year ago

Ok, I ran the command you pasted and it worked for me as well. So I guess this at least helps to narrow it down that it's something to do with my particular hyperparameters. Looks like I won't quite get to it today, but I'll put in some more work on Monday to try to see if I can narrow down what hyperparameter is causing it. At least now that at least I have that one script running, I can slowly add in my hyperparameter tweaks to see where it breaks!

If these issues are indeed related to memory, would the problems be resolved if I switched to the trn1.24xlarge instance? I guess my question is also: when we are talking about "not having enough memory" is that referring to CPU memory or Trainium chip memory?

michaelbenayoun commented 1 year ago

I mean Trainium chip. Just to make sure, the process fails right at the beginning? I am trying to figure out which parameter make things fails as well.

Also, maybe try running your script on the cpu first, and see if everything goes well (just for 5-10 steps) just to validate it.

njbrake commented 1 year ago

Found it!

if you add --label_smoothing_factor 0.1, that will reproduce the error

03/20/2023 12:30:05 PM ERROR 3459 [Tensorizer]: Transformation error on operator: _scatter.2
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: ***************************************************************
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]:  An Internal Compiler Error has occurred
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: ***************************************************************
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: 
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: Error message:  list index out of range
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: 
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: Error class:    IndexError
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: Error location: Unknown
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]: Command line:   /opt/aws_neuron_venv_pytorch/bin/neuronx-cc --target=trn1 compile --framework XLA /tmp/MODULE_1_SyncTensorsGraph.28326_10581472026669971527_ip-10-255-154-196.ec2.internal-353f0e91-2609-5f7540e46efd9.hlo.pb --output /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_10581472026669971527/MODULE_1_SyncTensorsGraph.28326_10581472026669971527_ip-10-255-154-196.ec2.internal-353f0e91-2609-5f7540e46efd9/44a76ae5-b288-4580-8565-d89aa4baab01/MODULE_1_SyncTensorsGraph.28326_10581472026669971527_ip-10-255-154-196.ec2.internal-353f0e91-2609-5f7540e46efd9.neff --model-type=transformer --verbose=35
03/20/2023 12:30:05 PM ERROR 3459 [neuronx-cc]:

michaelbenayoun commented 1 year ago

Nice, I am able to reproduce! Will keep you posted

michaelbenayoun commented 1 year ago

Oh awesome! I’ll do that. Thanks for the support on this 100 . Any idea when generation might be supported? That’s pretty important for my team that has factuality based eval metrics that rely upon generation.

The Neuron team is working to make the generate() method support fixed tensor shapes needed for the compiler. Also, they are still working on support for key operators such as TopK and Sort which are needed for generation (top-k, beam search and sampling). Top K will most likely come first.

michaelbenayoun commented 1 year ago

FYI https://github.com/aws-neuron/aws-neuron-sdk/issues/643#issuecomment-1500641091

huggingface / optimum-neuron

"List index out of range" error when using BART-base summarization model #19