Issue with SliRec and DKN OOM

miguelgfierro commented 3 months ago

Description

Fix #2063

I run both tests in a local V100 and they pass. In the nightly tests I get: https://github.com/recommenders-team/recommenders/actions/workflows/azureml-gpu-nightly.yml?query=branch%3Amiguel%2Fnightly_oom

The DKN test is already with 5 epochs and BS=200. I don´t know if both tests are conflicting because TF takes the whole memory of the GPU by default.

This shouldn't happen because in the BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Related Issues

References

Checklist:

[ ] I have followed the contribution guidelines and code style for this project.
[ ] I have added tests covering my contributions.
[ ] I have updated the documentation accordingly.
[ ] I have signed the commits, e.g. git commit -s -m "your commit message".
[ ] This PR is being made to staging branch AND NOT TO main branch.

miguelgfierro commented 3 months ago

If I comment all tests except SLiRec, then it works, and it is quick:

INFO:submit_groupwise_azureml_pytest.py:Executing tests now...
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-8.1.1, pluggy-1.4.0
rootdir: /mnt/azureml/cr/j/a2150f1522ab4d47941d5730dcfc5eb8/exe/wd
configfile: pyproject.toml
plugins: mock-3.14.0, hypothesis-6.99.13, typeguard-4.2.1, cov-5.0.0, anyio-4.3.0
collected 2 items

tests/unit/examples/test_notebooks_gpu.py .                              [ 50%]
tests/functional/examples/test_notebooks_gpu.py 

============================== slowest durations ===============================
343.83s call     tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional[recommenders/models/deeprec/config/sli_rec.yaml-tests/resources/deeprec/slirec-5-300-expected_values0-42]
1.36s call     tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm

(4 durations < 0.005s hidden.  Use -vv to show these durations.)
================== 2 passed, 12 warnings in 348.30s (0:05:48) ==================
INFO:submit_groupwise_azureml_pytest.py:Test execution completed!
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 5.421173095703125 seconds

See https://github.com/recommenders-team/recommenders/actions/runs/8422556385/job/23062115526

It works with:

        "tests/functional/examples/test_notebooks_gpu.py::test_dkn_quickstart_functional", 
        "tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional",

https://github.com/recommenders-team/recommenders/actions/runs/8424096579

If I have lightgcn, SLiRec and DKN https://github.com/recommenders-team/recommenders/pull/2076/commits/4a484e8e8df349e37a9f3f62a628cbb75e7ef881, I get an OOM https://github.com/recommenders-team/recommenders/actions/runs/8427004517/job/23076600230#step:3:8023

miguelgfierro commented 3 months ago

@SimonYansenZhao feel free to continue in this branch https://github.com/recommenders-team/recommenders/pull/2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

SimonYansenZhao commented 3 months ago

@SimonYansenZhao feel free to continue in this branch #2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Yeah, I ran each tests individually today, and all passed. I'll try to run the three tests together.

SimonYansenZhao commented 3 months ago

@miguelgfierro all tests including nightly builds passed after #2077 was merged. Not sure what's going on with the OOM issue. But I think we can merge staging into main now.

miguelgfierro commented 3 months ago

@SimonYansenZhao I´m freaking out.

miguelgfierro commented 3 months ago

Let me run again the nightlys

miguelgfierro commented 3 months ago

All nightly works. https://github.com/recommenders-team/recommenders/actions/runs/8565349616 https://github.com/recommenders-team/recommenders/actions/runs/8565355817 https://github.com/recommenders-team/recommenders/actions/runs/8565353778

Closing this.

recommenders-team / recommenders