huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
184 stars 57 forks source link

T5 not running on summarization script #469

Open dtsengAmazon opened 5 months ago

dtsengAmazon commented 5 months ago

Trying to run t5 model using optimum neuron and run_summarization script leads to failure. 4 experiments were run:

  1. ON 0.0.18 using the script on ON github
  2. ON 0.0.14 using the script on ON github
  3. ON 0.0.18 using the script on ON github commenting out line 460 "pipeline_parallel_size=training_args.pipeline_parallel_size" since it was causing errors
  4. ON 0.0.14 using the script on ON github commenting out line 460 "pipeline_parallel_size=training_args.pipeline_parallel_size" since it was causing errors

Logs to all 4 will be attached below.

Experiments 1 and 3 with ON 0.0.18 give "NotImplementedError: Sequence parallelism is not supported for <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'", even though it has been run with these settings before on 0.0.14. Experiment 2 gives "AttributeError: 'Seq2SeqNeuronTrainingArguments' object has no attribute 'pipeline_parallel_size'" which should default to 1? Experiment 4 exits out with exit code 15, no error line.

Steps to reproduce:

  1. new EC2 instance for trn1.32xlarge, install neuron 2.16.1 release
  2. install nxd compiler
  3. clone optimum neuron (https://github.com/huggingface/optimum-neuron/tree/main) and navigate to examples/summarization
  4. pip install optimum-neuron (tried with 0.0.14 and 0.0.18 [latest])
  5. pip install requirements -r
  6. add the run.sh script below (chmod +x run.sh to make it an executable)
  7. run with neuron_parallel_compile ./run.sh t5-11b 5 1 1 256 8 512 64

additional:

  1. run with neuron_parallel_compile ./run.sh t5-11b 5 1 1 256 1 512 64 which sets tensor parallelism to 1, which shows oom

EXP1_0.0.18_withPPline.txt EXP2_0.0.14_withPPline.txt EXP3_0.0.18_noPPline.txt EXP4_0.0.14_noPPline.txt

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!