recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.89k stars 3.07k forks source link

[FEATURE] Use rand-seed for reproducibility #736

Closed loomlike closed 5 years ago

loomlike commented 5 years ago

Description

Use seed for NN-based models - both at notebooks and tests

Expected behavior with the suggested feature

Produce same results from notebooks Assert exact value from tests

miguelgfierro commented 5 years ago

we are using seed in the algos: https://github.com/Microsoft/Recommenders/blob/master/reco_utils/recommender/rbm/rbm.py#L74. I'm not sure if there is anyone missing.

When trying to address reproducibility, @anargyri @msalvaris and I had some fun trying to achieve it. In DL is very difficult, since the optimization is stochastic and there are some asynchronous processes in the GPU that make difficult to have complete reproducibility.

loomlike commented 5 years ago

hmmm seems like there are many threads discussing about non-deterministic results from tf out there. In fastai, you can set several seeds to make the results reproducible.

Have you tried no-multi-threading with all the seeds set including tf.random.set_random_seed? Still no luck?

miguelgfierro commented 5 years ago

yeah we have that function in all the DL algos: https://github.com/Microsoft/Recommenders/search?q=tf.random.set_random_seed&unscoped_q=tf.random.set_random_seed

miguelgfierro commented 5 years ago

see this, just yesterday we got an error in the nightly builds for xdeepfm: https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/results?buildId=2873512

loomlike commented 5 years ago

yeah... we should handle this randomness, but I couldn't find time to work on this issue. If this is urgent thing, can anybody take this item and fix? If not, we can temporally loose the tolerance value for NN algos and I will take care in the next few weeks.

miguelgfierro commented 5 years ago

adding more folks to discuss this @anargyri @yueguoguo @gramhagen. I agree with Jun Ki, I think it is a very annoying issue. Many times our GPU nightly builds fails because of this

miguelgfierro commented 5 years ago

One example of a nightly build that failed:

tests/smoke/test_deeprec_model.py ..                                     [ 25%]
tests/smoke/test_notebooks_gpu.py ...F..                                 [100%]

=================================== FAILURES ===================================
____________________________ test_notebook_xdeepfm _____________________________

notebooks = {'als_deep_dive': '/data/home/recocat/cicd/17/s/notebooks/02_model/als_deep_dive.ipynb', 'als_pyspark': '/data/home/re...aseline_deep_dive.ipynb', 'data_split': '/data/home/recocat/cicd/17/s/notebooks/01_prepare_data/data_split.ipynb', ...}

    @pytest.mark.smoke
    @pytest.mark.gpu
    @pytest.mark.deeprec
    def test_notebook_xdeepfm(notebooks):
        notebook_path = notebooks["xdeepfm_quickstart"]
        pm.execute_notebook(
            notebook_path,
            OUTPUT_NOTEBOOK,
            kernel_name=KERNEL_NAME,
            parameters=dict(
                EPOCHS_FOR_SYNTHETIC_RUN=20,
                EPOCHS_FOR_CRITEO_RUN=1,
                BATCH_SIZE_SYNTHETIC=128,
                BATCH_SIZE_CRITEO=2048,
            ),
        )
        results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]

        assert results["res_syn"]["auc"] == pytest.approx(0.982, rel=TOL, abs=ABS_TOL)
>       assert results["res_syn"]["logloss"] == pytest.approx(0.2306, rel=TOL, abs=ABS_TOL)
E       assert 0.103 == 0.2306 ± 1.2e-01
E        +  where 0.2306 ± 1.2e-01 = <function approx at 0x7f8fa788b840>(0.2306, rel=0.5, abs=0.05)
E        +    where <function approx at 0x7f8fa788b840> = pytest.approx

I think in this case it is safe to widen the tolerances because in the smoke tests we are doing a small number of iterations, so it's normal that the metrics change a lot. Probably in the integration tests we can be more strict, because there we have more iterations and the model should converge to certain metrics