intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
17 stars 4 forks source link

jenkins: orca auto estimator tests sometime aborted #212

Closed pinggao187 closed 3 years ago

pinggao187 commented 3 years ago

Running orca auto estimator tests ============================= test session starts ============================== platform linux -- Python 3.6.10, pytest-5.4.1, py-1.8.1, pluggy-0.12.0 -- /opt/work/conda/envs/py36/bin/python cachedir: .pytest_cache rootdir: /opt/work/jenkins/workspace/ZOO-NB-UnitTests-3.0-PYTHON/pyzoo plugins: forked-1.1.2, xdist-1.31.0 collecting ... collected 11 items

../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit PASSED [ 9%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit_metric_func PASSED [ 18%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit_multiple_times PASSED [ 27%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py::TestPyTorchAutoEstimator::test_fit PASSED [ 36%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py::TestPyTorchAutoEstimator::test_fit_data_creator PASSED [ 45%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py::TestPyTorchAutoEstimator::test_fit_invalid_loss_name pyzoo/dev/run-pytests-ray: line 85: 8298 Aborted python -m pytest -v ../test/zoo/orca/automl/autoestimator

jenkins link: http://10.239.176.111:18888/job/ZOO-NB-UnitTests-3.0-PYTHON/226/console

Running orca auto estimator tests ============================= test session starts ============================== platform linux -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.13.0 -- /opt/work/conda/envs/py37/bin/python cachedir: .pytest_cache rootdir: /opt/work/jenkins/workspace/ZOO-PR-Python-Spark-3.0-py37-ray/pyzoo plugins: forked-1.1.2, xdist-1.31.0 collecting ... collected 9 items

../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit PASSED [ 11%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit_multiple_times PASSED [ 22%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py::TestPyTorchAutoEstimator::test_fit PASSED [ 33%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py::TestPyTorchAutoEstimator::test_fit_data_creator pyzoo/dev/run-pytests-ray: line 85: 164029 Aborted python -m pytest -v ../test/zoo/orca/automl/autoestimator

jenkins link: http://10.239.176.111:18888/job/ZOO-PR-Python-Spark-3.0-py37-ray/1183/console

Running orca auto estimator tests ============================= test session starts ============================== platform darwin -- Python 3.7.7, pytest-5.4.1, py-1.8.1, pluggy-0.13.0 -- /Users/arda/anaconda3/envs/py37/bin/python cachedir: .pytest_cache rootdir: /private/var/jenkins_home/workspace/ZOO-NB-UnitTests-3.0-PYTHON-MAC/pyzoo plugins: forked-1.1.2, xdist-1.31.0 collecting ... collected 9 items

../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit PASSED [ 11%] ../test/zoo/orca/automl/autoestimator/test_autoestimator_keras.py::TestTFKerasAutoEstimator::test_fit_multiple_times pyzoo/dev/run-pytests-ray: line 85: 85724 Abort trap: 6 python -m pytest -v ../test/zoo/orca/automl/autoestimator

jenkins link: http://10.239.176.111:18888/view/ZOO-NB/job/ZOO-NB-UnitTests-3.0-PYTHON-MAC/185/console

TheaperDeng commented 3 years ago

https://github.com/intel-analytics/analytics-zoo/blob/02002c70bd8c3e8c737e2bd051c77c362fb8108b/pyzoo/test/zoo/orca/automl/autoestimator/test_autoestimator_pytorch.py#L118-L120

Maybe we can reduce the core num @yushan111

shanyu-sys commented 3 years ago

Still now sure about the cause...

It shouldn't be number of cores, since almost all UTs in chronos.autots and orca.learn.ray used cores=8.

For memory issue, I tested in local and found AutoEstimator tests used less than 2G memory and there isn't any memory leak.

For ray processes, I tested in local and stop_orca_context killed all ray processes.

And I didn't find any abnormal process with large memory on the container.

To help locate the issue, I will change the order of AutoEstimator tests. And I will also reduce the trial number to reduce the memory in need.