recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.89k stars 3.07k forks source link

DeepRec smoke tests gives a OOM #478

Closed miguelgfierro closed 5 years ago

miguelgfierro commented 5 years ago

What is affected by this bug?

Smoke tests.

2019-01-31T18:59:24.6946026Z tests/smoke/test_deeprec_model.py ..FF                                   [ 57%]
2019-01-31T19:00:09.5820104Z tests/smoke/test_notebooks_gpu.py ..F                                    [100%]
2019-01-31T19:00:09.5820400Z 
2019-01-31T19:00:09.5823428Z =================================== FAILURES ===================================
2019-01-31T19:00:09.5824122Z ____________________________ test_notebook_xdeepfm _____________________________
2019-01-31T19:00:09.5824306Z 
2019-01-31T19:00:09.5825291Z notebooks = {'als_deep_dive': '/data/home/recocat/cicd/18/s/notebooks/02_model/als_deep_dive.ipynb', 'als_pyspark': '/data/home/re...aseline_deep_dive.ipynb', 'data_split': '/data/home/recocat/cicd/18/s/notebooks/01_prepare_data/data_split.ipynb', ...}
2019-01-31T19:00:09.5825418Z 
2019-01-31T19:00:09.5825517Z     @pytest.mark.smoke
2019-01-31T19:00:09.5825559Z     @pytest.mark.gpu
2019-01-31T19:00:09.5825598Z     @pytest.mark.deeprec
2019-01-31T19:00:09.5825994Z     def test_notebook_xdeepfm(notebooks):
2019-01-31T19:00:09.5826074Z         notebook_path = notebooks["xdeepfm_quickstart"]
2019-01-31T19:00:09.5828021Z         pm.execute_notebook(
2019-01-31T19:00:09.5828205Z             notebook_path,
2019-01-31T19:00:09.5828310Z             OUTPUT_NOTEBOOK,
2019-01-31T19:00:09.5828356Z             kernel_name=KERNEL_NAME,
2019-01-31T19:00:09.5828405Z >           parameters=dict(epochs_for_synthetic_run=20, epochs_for_criteo_run=1),
2019-01-31T19:00:09.5828507Z         )
2019-01-31T19:00:09.5828539Z 
2019-01-31T19:00:09.5828581Z tests/smoke/test_deeprec_model.py:72: 
2019-01-31T19:00:09.5828670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2019-01-31T19:00:09.5829162Z /anaconda/envs/nightly_gpu/lib/python3.6/site-packages/papermill/execute.py:78: in execute_notebook
2019-01-31T19:00:09.5829243Z     raise_for_execution_errors(nb, output_path)
2019-01-31T19:00:09.5829344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2019-01-31T19:00:09.5829380Z 
2019-01-31T19:00:09.5829784Z nb = {'cells': [{'cell_type': 'code', 'metadata': {'inputHidden': True, 'hide_input': True}, 'execution_count': None, 'sour...d_time': '2019-01-31T18:55:44.346403', 'duration': 139.097582, 'exception': True}}, 'nbformat': 4, 'nbformat_minor': 2}
2019-01-31T19:00:09.5830271Z output_path = 'output.ipynb'
2019-01-31T19:00:09.5830308Z 
2019-01-31T19:00:09.5830392Z     def raise_for_execution_errors(nb, output_path):
2019-01-31T19:00:09.5830438Z         error = None
2019-01-31T19:00:09.5830638Z         for cell in nb.cells:
2019-01-31T19:00:09.5832196Z             if cell.get("outputs") is None:
2019-01-31T19:00:09.5832399Z                 continue
2019-01-31T19:00:09.5832445Z     
2019-01-31T19:00:09.5832809Z             for output in cell.outputs:
2019-01-31T19:00:09.5833664Z                 if output.output_type == "error":
2019-01-31T19:00:09.5833911Z                     error = PapermillExecutionError(
2019-01-31T19:00:09.5835901Z                         exec_count=cell.execution_count,
2019-01-31T19:00:09.5835980Z                         source=cell.source,
2019-01-31T19:00:09.5836098Z                         ename=output.ename,
2019-01-31T19:00:09.5836143Z                         evalue=output.evalue,
2019-01-31T19:00:09.5836189Z                         traceback=output.traceback,
2019-01-31T19:00:09.5836276Z                     )
2019-01-31T19:00:09.5836318Z                     break
2019-01-31T19:00:09.5836357Z     
2019-01-31T19:00:09.5836435Z         if error:
2019-01-31T19:00:09.5836483Z             # Write notebook back out with the Error Message at the top of the Notebook.
2019-01-31T19:00:09.5842777Z             error_msg = ERROR_MESSAGE_TEMPLATE % str(error.exec_count)
2019-01-31T19:00:09.5843714Z             error_msg_cell = nbformat.v4.new_code_cell(
2019-01-31T19:00:09.5843767Z                 source="%%html\n" + error_msg,
2019-01-31T19:00:09.5843859Z                 outputs=[
2019-01-31T19:00:09.5843909Z                     nbformat.v4.new_output(output_type="display_data", data={"text/html": error_msg})
2019-01-31T19:00:09.5843958Z                 ],
2019-01-31T19:00:09.5844056Z                 metadata={"inputHidden": True, "hide_input": True},
2019-01-31T19:00:09.5844103Z             )
2019-01-31T19:00:09.5844146Z             nb.cells = [error_msg_cell] + nb.cells
2019-01-31T19:00:09.5844232Z             write_ipynb(nb, output_path)
2019-01-31T19:00:09.5844276Z >           raise error
2019-01-31T19:00:09.5844320Z E           papermill.exceptions.PapermillExecutionError: 
2019-01-31T19:00:09.5844751Z E           ---------------------------------------------------------------------------
2019-01-31T19:00:09.5844808Z E           Exception encountered at "In [12]":
2019-01-31T19:00:09.5845084Z E           ---------------------------------------------------------------------------
2019-01-31T19:00:09.5845141Z E           ResourceExhaustedError                    Traceback (most recent call last)
2019-01-31T19:00:09.5845399Z E           /anaconda/envs/nightly_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
2019-01-31T19:00:09.5845502Z E              1333     try:
2019-01-31T19:00:09.5845686Z E           -> 1334       return fn(*args)
2019-01-31T19:00:09.5845736Z E              1335     except errors.OpError as e:
2019-01-31T19:00:09.5845824Z E           
2019-01-31T19:00:09.5846113Z E           /anaconda/envs/nightly_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
2019-01-31T19:00:09.5846215Z E              1318       return self._call_tf_sessionrun(
2019-01-31T19:00:09.5846433Z E           -> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
2019-01-31T19:00:09.5846492Z E              1320 
2019-01-31T19:00:09.5846573Z E           
2019-01-31T19:00:09.5846872Z E           /anaconda/envs/nightly_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
2019-01-31T19:00:09.5846939Z E              1406         self._session, options, feed_dict, fetch_list, target_list,
2019-01-31T19:00:09.5847322Z E           -> 1407         run_metadata)
2019-01-31T19:00:09.5847371Z E              1408 
2019-01-31T19:00:09.5847413Z E           
2019-01-31T19:00:09.5847545Z E           ResourceExhaustedError: OOM when allocating tensor of shape [2300000,10] and type float
2019-01-31T19:00:09.5847621Z E               [[{{node XDeepFM/embedding/embedding_layer/Adam/Initializer/zeros}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2300000,10] values: [0 0 0...]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
2019-01-31T19:00:09.5847730Z E           

More info: https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/results?buildId=2265907

In which platform does it happen?

miguelgfierro commented 5 years ago

I got a weird error:

    @pytest.mark.smoke
    @pytest.mark.gpu
    @pytest.mark.deeprec
    def test_notebook_dkn(notebooks):
        notebook_path = notebooks["dkn_quickstart"]
        pm.execute_notebook(
            notebook_path,
            OUTPUT_NOTEBOOK,
            kernel_name=KERNEL_NAME,
            parameters=dict(epoch=1),
        )
        results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]

        assert results["res"]["auc"] == pytest.approx(0.4707, TOL2)
        assert results["res"]["acc"] == pytest.approx(0.5725, TOL2)
>       assert results["res"]["f1"] == pytest.approx(0.7281, TOL2)
E       assert 0.0 == 0.7281 ± 3.6e-01
E        +  where 0.7281 ± 3.6e-01 = <function approx at 0x7f10034c87b8>(0.7281, 0.5)
E        +    where <function approx at 0x7f10034c87b8> = pytest.approx

tests/smoke/test_notebooks_gpu.py:126: AssertionError
anargyri commented 5 years ago

This does not look good. You should not get zero F1 score.
Could you double check the initialization and the hyperparameters?

miguelgfierro commented 5 years ago

It looks that running: pytest tests/smoke -m "smoke and not spark and gpu" give an OOM, however, running pytest tests/smoke/test_deeprec_model.py and pytest tests/smoke/test_notebooks_gpu.py independently doesn't

Maybe there is an open process from test_deeprec_model.py ... weirddddd

miguelgfierro commented 5 years ago

this might be the solution https://stackoverflow.com/a/37454574