Closed miguelgfierro closed 3 months ago
If I comment all tests except SLiRec, then it works, and it is quick:
INFO:submit_groupwise_azureml_pytest.py:Executing tests now...
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-8.1.1, pluggy-1.4.0
rootdir: /mnt/azureml/cr/j/a2150f1522ab4d47941d5730dcfc5eb8/exe/wd
configfile: pyproject.toml
plugins: mock-3.14.0, hypothesis-6.99.13, typeguard-4.2.1, cov-5.0.0, anyio-4.3.0
collected 2 items
tests/unit/examples/test_notebooks_gpu.py . [ 50%]
tests/functional/examples/test_notebooks_gpu.py
============================== slowest durations ===============================
343.83s call tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional[recommenders/models/deeprec/config/sli_rec.yaml-tests/resources/deeprec/slirec-5-300-expected_values0-42]
1.36s call tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm
(4 durations < 0.005s hidden. Use -vv to show these durations.)
================== 2 passed, 12 warnings in 348.30s (0:05:48) ==================
INFO:submit_groupwise_azureml_pytest.py:Test execution completed!
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 5.421173095703125 seconds
See https://github.com/recommenders-team/recommenders/actions/runs/8422556385/job/23062115526
It works with:
"tests/functional/examples/test_notebooks_gpu.py::test_dkn_quickstart_functional",
"tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional",
https://github.com/recommenders-team/recommenders/actions/runs/8424096579
If I have lightgcn, SLiRec and DKN https://github.com/recommenders-team/recommenders/pull/2076/commits/4a484e8e8df349e37a9f3f62a628cbb75e7ef881, I get an OOM https://github.com/recommenders-team/recommenders/actions/runs/8427004517/job/23076600230#step:3:8023
@SimonYansenZhao feel free to continue in this branch https://github.com/recommenders-team/recommenders/pull/2076
The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session()
but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
. However, when I run the code in local, I see TF getting all the memory.
@SimonYansenZhao feel free to continue in this branch #2076
The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried
tf.keras.backend.clear_session()
but it didn't work. Also, in BaseModel class we havegpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
. However, when I run the code in local, I see TF getting all the memory.
Yeah, I ran each tests individually today, and all passed. I'll try to run the three tests together.
@miguelgfierro all tests including nightly builds passed after #2077 was merged. Not sure what's going on with the OOM issue. But I think we can merge staging into main now.
@SimonYansenZhao I´m freaking out.
Let me run again the nightlys
Description
Fix #2063
I run both tests in a local V100 and they pass. In the nightly tests I get: https://github.com/recommenders-team/recommenders/actions/workflows/azureml-gpu-nightly.yml?query=branch%3Amiguel%2Fnightly_oom
The DKN test is already with 5 epochs and BS=200. I don´t know if both tests are conflicting because TF takes the whole memory of the GPU by default.
This shouldn't happen because in the BaseModel class we have
gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
. However, when I run the code in local, I see TF getting all the memory.Related Issues
References
Checklist:
git commit -s -m "your commit message"
.staging branch
AND NOT TOmain branch
.