[BUG] OOM in nightly SliRec and DKN

Description

SLiRec:

2024-02-19T18:34:57.2954866Z     @pytest.mark.gpu
2024-02-19T18:34:57.2955209Z     @pytest.mark.notebooks
2024-02-19T18:34:57.2955584Z     @pytest.mark.parametrize(
2024-02-19T18:34:57.2956266Z         "yaml_file, data_path, epochs, batch_size, expected_values, seed",
2024-02-19T18:34:57.2956825Z         [
2024-02-19T18:34:57.2957104Z             (
2024-02-19T18:34:57.2957503Z                 "recommenders/models/deeprec/config/sli_rec.yaml",
2024-02-19T18:34:57.2958128Z                 os.path.join("tests", "resources", "deeprec", "slirec"),
2024-02-19T18:34:57.2958630Z                 10,
2024-02-19T18:34:57.2958948Z                 400,
2024-02-19T18:34:57.2959620Z                 ***"auc": 0.7183***,  # Don't do logloss check as SLi-Rec uses ranking loss, not a point-wise loss
2024-02-19T18:34:57.2960251Z                 42,
2024-02-19T18:34:57.2960549Z             )
2024-02-19T18:34:57.2960841Z         ],
2024-02-19T18:34:57.2961116Z     )
2024-02-19T18:34:57.2961430Z     def test_slirec_quickstart_functional(
2024-02-19T18:34:57.2961851Z         notebooks,
2024-02-19T18:34:57.2962168Z         output_notebook,
2024-02-19T18:34:57.2962505Z         kernel_name,
2024-02-19T18:34:57.2962827Z         yaml_file,
2024-02-19T18:34:57.2963138Z         data_path,
2024-02-19T18:34:57.2963434Z         epochs,
2024-02-19T18:34:57.2963891Z         batch_size,
2024-02-19T18:34:57.2964215Z         expected_values,
2024-02-19T18:34:57.2964557Z         seed,
2024-02-19T18:34:57.2964842Z     ):
2024-02-19T18:34:57.2965198Z         notebook_path = notebooks["slirec_quickstart"]
2024-02-19T18:34:57.2965660Z     
2024-02-19T18:34:57.2965951Z         params = ***
2024-02-19T18:34:57.2966294Z             "yaml_file": yaml_file,
2024-02-19T18:34:57.2966694Z             "data_path": data_path,
2024-02-19T18:34:57.2967077Z             "EPOCHS": epochs,
2024-02-19T18:34:57.2967455Z             "BATCH_SIZE": batch_size,
2024-02-19T18:34:57.2967858Z             "RANDOM_SEED": seed,
2024-02-19T18:34:57.2968237Z         ***
2024-02-19T18:34:57.2968533Z >       execute_notebook(
2024-02-19T18:34:57.2969074Z             notebook_path, output_notebook, kernel_name=kernel_name, parameters=params
2024-02-19T18:34:57.2969651Z         )
...
...
2024-02-19T18:34:57.3146322Z E           Node: 'gradients/sequential/sli_rec/attention_fcn/attention_fcn/att_fcn/nn_part/Tensordot/MatMul_grad/MatMul'
2024-02-19T18:34:57.3147524Z E           OOM when allocating tensor with shape[100000,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2024-02-19T18:34:57.3148737Z E               [[***node gradients/sequential/sli_rec/attention_fcn/attention_fcn/att_fcn/nn_part/Tensordot/MatMul_grad/MatMul***]]
2024-02-19T18:34:57.3150345Z E           Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
...
...
2024-02-19T18:34:57.5143539Z 2024-02-19 18:33:33.543531: I external/local_tsl/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 19795968 totalling 18.88MiB
2024-02-19T18:34:57.5144130Z 2024-02-19 18:33:33.543537: I external/local_tsl/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 32000000 totalling 30.52MiB
2024-02-19T18:34:57.5144750Z 2024-02-19 18:33:33.543543: I external/local_tsl/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 76880128 totalling 73.32MiB
2024-02-19T18:34:57.5145349Z 2024-02-19 18:33:33.543549: I external/local_tsl/tsl/framework/bfc_allocator.cc:1107] Sum Total of in-use chunks: 490.36MiB
2024-02-19T18:34:57.5146324Z 2024-02-19 18:33:33.543555: I external/local_tsl/tsl/framework/bfc_allocator.cc:1109] Total bytes in pool: 735707136 memory_limit_: 735707136 available bytes: 0 curr_region_allocation_bytes_: 1471414272
2024-02-19T18:34:57.5146770Z 2024-02-19 18:33:33.543565: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Stats: 
2024-02-19T18:34:57.5146926Z Limit:                       735707136
2024-02-19T18:34:57.5147068Z InUse:                       514176000
2024-02-19T18:34:57.5147210Z MaxInUse:                    670024448
2024-02-19T18:34:57.5147351Z NumAllocs:                      339864
2024-02-19T18:34:57.5147496Z MaxAllocSize:                 87159808
2024-02-19T18:34:57.5147640Z Reserved:                            0
2024-02-19T18:34:57.5147787Z PeakReserved:                        0
2024-02-19T18:34:57.5147937Z LargestFreeBlock:                    0
2024-02-19T18:34:57.5147944Z 
2024-02-19T18:34:57.5148849Z 2024-02-19 18:33:33.543609: W external/local_tsl/tsl/framework/bfc_allocator.cc:497] *****************************************_***_*******_____**_***____***______*********___**********x
2024-02-19T18:34:57.5150187Z 2024-02-19 18:33:33.543631: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at matmul_op_impl.h:921 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[100000,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

DKN:

2024-02-19T18:34:57.2599869Z     @pytest.mark.gpu
2024-02-19T18:34:57.2600449Z     @pytest.mark.notebooks
2024-02-19T18:34:57.2601381Z     def test_dkn_quickstart_functional(notebooks, output_notebook, kernel_name):
2024-02-19T18:34:57.2602467Z         notebook_path = notebooks["dkn_quickstart"]
2024-02-19T18:34:57.2603235Z >       execute_notebook(
2024-02-19T18:34:57.2603812Z             notebook_path,
2024-02-19T18:34:57.2604398Z             output_notebook,
2024-02-19T18:34:57.2605021Z             kernel_name=kernel_name,
2024-02-19T18:34:57.2606077Z             parameters=dict(EPOCHS=5, BATCH_SIZE=500),
2024-02-19T18:34:57.2606847Z         )
...
...
2024-02-19T18:34:57.2824248Z E             (0) NOT_FOUND: No algorithm worked!  Error messages:
2024-02-19T18:34:57.2825665Z E             Profiling failure on CUDNN engine eng11***: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16777216 bytes.
2024-02-19T18:34:57.2826837Z E             Profiling failure on CUDNN engine eng0***: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16777216 bytes.
2024-02-19T18:34:57.2827773Z E               [[***node DKN/attention_net/kcnn/conv-maxpool-3_1/relu***]]
2024-02-19T18:34:57.2828293Z E               [[pred/_15]]
2024-02-19T18:34:57.2828736Z E             (1) NOT_FOUND: No algorithm worked!  Error messages:
2024-02-19T18:34:57.2829605Z E             Profiling failure on CUDNN engine eng11***: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16777216 bytes.
2024-02-19T18:34:57.2830719Z E             Profiling failure on CUDNN engine eng0***: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16777216 bytes.
2024-02-19T18:34:57.2831642Z E               [[***node DKN/attention_net/kcnn/conv-maxpool-3_1/relu***]]

In which platform does it happen?

How do we replicate the issue?

See https://github.com/recommenders-team/recommenders/actions/runs/7963399372/job/21738878728

Expected behavior (i.e. solution)

Other Comments

FYI @SimonYansenZhao

Still the same error in nightly GPU: https://github.com/recommenders-team/recommenders/actions/runs/8372395934/job/22923293913

2024-03-21 09:53:25.709498: W external/local_tsl/tsl/framework/bfc_allocator.cc:497] ********____***_____*_____*****************___*____*____***************************************xxxxx
2024-03-21 09:53:25.709518: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at conv_grad_input_ops.cc:345 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[10000,1,10,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
_ test_slirec_quickstart_functional[recommenders/models/deeprec/config/sli_rec.yaml-tests/resources/deeprec/slirec-10-400-expected_values0-42] _

notebooks = ***'als_deep_dive': '/mnt/azureml/cr/j/b321b89c582248be83f964f9e412bd91/exe/wd/examples/02_model_collaborative_filtering...rk_movielens': '/mnt/azureml/cr/j/b321b89c582248be83f964f9e412bd91/exe/wd/examples/06_benchmarks/movielens.ipynb', ...***
output_notebook = 'output.ipynb', kernel_name = 'python3'
yaml_file = 'recommenders/models/deeprec/config/sli_rec.yaml'
data_path = 'tests/resources/deeprec/slirec', epochs = 10, batch_size = 400
expected_values = ***'auc': 0.7183***, seed = 42

    @pytest.mark.gpu
    @pytest.mark.notebooks
    @pytest.mark.parametrize(
        "yaml_file, data_path, epochs, batch_size, expected_values, seed",
        [
            (
                "recommenders/models/deeprec/config/sli_rec.yaml",
                os.path.join("tests", "resources", "deeprec", "slirec"),
                10,
                400,
                ***
                    "auc": 0.7183
                ***,  # Don't do logloss check as SLi-Rec uses ranking loss, not a point-wise loss
                42,
            )
        ],
    )
    def test_slirec_quickstart_functional(
        notebooks,
        output_notebook,
        kernel_name,
        yaml_file,
        data_path,
        epochs,
        batch_size,
        expected_values,
        seed,
    ):
        notebook_path = notebooks["slirec_quickstart"]

        params = ***
            "yaml_file": yaml_file,
            "data_path": data_path,
            "EPOCHS": epochs,
            "BATCH_SIZE": batch_size,
            "RANDOM_SEED": seed,
        ***
>       execute_notebook(
            notebook_path, output_notebook, kernel_name=kernel_name, parameters=params
        )

recommenders-team / recommenders

[BUG] OOM in nightly SliRec and DKN #2063

Description

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments