Open aws-sadaf opened 4 months ago
Instance type: Trn1.32xlarge pre-installed package: aws-neuronx-runtime-discovery 2.9 libneuronxla 2.0.965 neuronx-cc 2.13.72.0+78a426937 torch-neuronx 2.1.2.2.1.0 torch 2.1.2 torch-neuronx 2.1.2.2.1.0 torch-xla 2.1.2 torchvision 0.16.2 Optimum Neuron Version - optimum-neuron 0.0.23.dev0 Getting the following test failures when running the Training tests - ============================= test session starts ============================== platform linux -- Python 3.10.12, pytest-8.0.0, pluggy-1.4.0 -- /home/ubuntu/aws_neuron_venv_pytorch/bin/python3 cachedir: .pytest_cache rootdir: /home/ubuntu/optimum-neuron configfile: pyproject.toml plugins: timeout-2.3.1, remotedata-0.4.1, xdist-3.5.0 collecting ... collected 26 items tests/test_examples.py::CausalLMExampleTester::test_run_clm_gpt_neo PASSED [ 3%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_gpt_neo_with_tp_only FAILED [ 7%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama PASSED [ 11%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_pp_only FAILED [ 15%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_tp_and_pp FAILED [ 19%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_llama_with_tp_only PASSED [ 23%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_mistral FAILED [ 26%] tests/test_examples.py::CausalLMExampleTester::test_run_clm_mistral_with_tp_only FAILED [ 30%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_bert PASSED [ 34%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_bert_with_tp_only PASSED [ 38%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama FAILED [ 42%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_pp_only FAILED [ 46%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_tp_and_pp FAILED [ 50%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_llama_with_tp_only FAILED [ 53%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_mistral FAILED [ 57%] tests/test_examples.py::TextClassificationExampleTester::test_run_glue_mistral_with_tp_only FAILED [ 61%] tests/test_examples.py::TokenClassificationExampleTester::test_run_ner_bert PASSED [ 65%] tests/test_examples.py::TokenClassificationExampleTester::test_run_ner_bert_with_tp_only PASSED [ 69%] tests/test_examples.py::MultipleChoiceExampleTester::test_run_swag_bert PASSED [ 73%] tests/test_examples.py::MultipleChoiceExampleTester::test_run_swag_bert_with_tp_only PASSED [ 76%] tests/test_examples.py::QuestionAnsweringExampleTester::test_run_qa_bert PASSED [ 80%] tests/test_examples.py::QuestionAnsweringExampleTester::test_run_qa_bert_with_tp_only PASSED [ 84%] tests/test_examples.py::SummarizationExampleTester::test_run_summarization_t5 FAILED [ 88%] tests/test_examples.py::SummarizationExampleTester::test_run_summarization_t5_with_tp_only FAILED [ 92%] tests/test_examples.py::TranslationExampleTester::test_run_translation_t5 FAILED [ 96%] tests/test_examples.py::TranslationExampleTester::test_run_translation_t5_with_tp_only FAILED [100%] Please refer to the attached document with complete logs.
No response
examples
RUN_SLOW=true COVERAGE=high RUN_TINY=true USE_VENV=false pytest tests/test_examples.py -v
Expect to see training suite passing. Complete logs are attached. ON_TrainingResults.txt
Hi @aws-sadaf will you run the following also and report results?
pytest tests/distributed -v
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
RUN_SLOW=true COVERAGE=high RUN_TINY=true USE_VENV=false pytest tests/test_examples.py -v
Expected behavior
Expect to see training suite passing. Complete logs are attached. ON_TrainingResults.txt