huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
195 stars 59 forks source link

Optimum Neuron Training Tests Failures #597

Open aws-sadaf opened 4 months ago

aws-sadaf commented 4 months ago

System Info

Instance type: Trn1.32xlarge

pre-installed package:

aws-neuronx-runtime-discovery 2.9
libneuronxla                  2.0.965
neuronx-cc                    2.13.72.0+78a426937
torch-neuronx                 2.1.2.2.1.0
torch                         2.1.2
torch-neuronx                 2.1.2.2.1.0
torch-xla                     2.1.2
torchvision                   0.16.2

Optimum Neuron Version - 
optimum-neuron                0.0.23.dev0

Getting the following test failures when running the Training tests - 
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.0.0, pluggy-1.4.0 -- /home/ubuntu/aws_neuron_venv_pytorch/bin/python3
cachedir: .pytest_cache
rootdir: /home/ubuntu/optimum-neuron
configfile: pyproject.toml
plugins: timeout-2.3.1, remotedata-0.4.1, xdist-3.5.0
collecting ... collected 26 items

tests/test_examples.py::CausalLMExampleTester::test_run_clm_gpt_neo PASSED [  3%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_gpt_neo_with_tp_only FAILED [  7%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama PASSED [ 11%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_pp_only FAILED [ 15%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_tp_and_pp FAILED [ 19%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_tp_only PASSED [ 23%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_mistral FAILED [ 26%]
tests/test_examples.py::CausalLMExampleTester::test_run_clm_mistral_with_tp_only FAILED [ 30%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_bert PASSED [ 34%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_bert_with_tp_only PASSED [ 38%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama FAILED [ 42%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_pp_only FAILED [ 46%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_tp_and_pp FAILED [ 50%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_tp_only FAILED [ 53%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_mistral FAILED [ 57%]
tests/test_examples.py::TextClassificationExampleTester::test_run_glue_mistral_with_tp_only FAILED [ 61%]
tests/test_examples.py::TokenClassificationExampleTester::test_run_ner_bert PASSED [ 65%]
tests/test_examples.py::TokenClassificationExampleTester::test_run_ner_bert_with_tp_only PASSED [ 69%]
tests/test_examples.py::MultipleChoiceExampleTester::test_run_swag_bert PASSED [ 73%]
tests/test_examples.py::MultipleChoiceExampleTester::test_run_swag_bert_with_tp_only PASSED [ 76%]
tests/test_examples.py::QuestionAnsweringExampleTester::test_run_qa_bert PASSED [ 80%]
tests/test_examples.py::QuestionAnsweringExampleTester::test_run_qa_bert_with_tp_only PASSED [ 84%]
tests/test_examples.py::SummarizationExampleTester::test_run_summarization_t5 FAILED [ 88%]
tests/test_examples.py::SummarizationExampleTester::test_run_summarization_t5_with_tp_only FAILED [ 92%]
tests/test_examples.py::TranslationExampleTester::test_run_translation_t5 FAILED [ 96%]
tests/test_examples.py::TranslationExampleTester::test_run_translation_t5_with_tp_only FAILED [100%]

Please refer to the attached document with complete logs.

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

RUN_SLOW=true COVERAGE=high RUN_TINY=true USE_VENV=false pytest tests/test_examples.py -v

Expected behavior

Expect to see training suite passing. Complete logs are attached. ON_TrainingResults.txt

jeffhataws commented 4 months ago

Hi @aws-sadaf will you run the following also and report results?

pytest tests/distributed -v