reproducibility issues - Githubissues

adebiasio-estilos commented 1 year ago

When executing https://github.com/massquantity/LibRecommender/blob/master/examples/pure_ranking_example.py, SVD and NCF (the ones I tested) gives every run a different result.

What have changes compared to the previous versions?

adebiasio-estilos commented 1 year ago

By modifying the library and setting

  os.environ['PYTHONHASHSEED']=str(self.seed)
  random.seed(self.seed) 
  np.random.seed(self.seed) 
  tf.set_random_seed(self.seed)
  torch.manual_seed(self.seed)

into the build_model() function is a bit better for SVD but the behavior of NCF is strange. I checked the Initialization of embeddings, and the positives and negatives examples to be fed and they are ok.

I don't know what is missing. Do you have any ideas?

massquantity commented 1 year ago

Most likely it's the shuffle behavior in data, you can try addingtorch.manual_seed(seed) before all the code.

adebiasio-estilos commented 1 year ago

Hi, first of all thanks for answering and for your great work with the library!

I added also

  os.environ['PYTHONHASHSEED']=str(seed)
  random.seed(seed) 
  np.random.seed(seed) 
  tf.set_random_seed(seed)
  torch.manual_seed(seed)

prior all the code.

SVD looks reproducible now. Besides some non-deterministic behavior in probs computation due to hardware reasons (i.e., GPUs), I get almost always reproducible results with a certain approximation.

However I'm still struggling with NCF.. Do you have some other ideas? I checked the initialization of embeddings and the positives and negatives samples that are fed to the model at each epoch and they are reproducible. I don't understand what is missing..

What I noticed with NCF that seems strange is that if I run the code the first time I get a certain result. If I re-run the code for like 3 to 4 times, at some point I get again the first result..

massquantity commented 1 year ago

In my case, i can reproduce the results using NCF. I set tf.set_random_seed(seed) in build_model() function and torch.manual_seed(seed) before all the code. You can also try setting shffle=False to see what you can get.

adebiasio-estilos commented 1 year ago

I tried with shuffle=False but the results are still strange..

I created a jupyter notebook for reference that may help. If you execute it, you still get reproducible results?

Test.zip

massquantity commented 1 year ago

Okay, it seems that adding torch.manual_seed before the code is not effective. However, adding it in the get_batch_loader function works.

def get_batch_loader(model, data, neg_sampling, batch_size, shuffle, num_workers=0):
    torch.manual_seed(42)
    ...
    sampler = RandomSampler(batch_data) if shuffle else SequentialSampler(batch_data)
    batch_sampler = BatchSampler(sampler, batch_size=batch_size, drop_last=False)
    collate_fn = get_collate_fn(model, neg_sampling, num_workers)
    return DataLoader(
        batch_data,
        batch_size=None,  # `batch_size=None` disables automatic batching
        sampler=batch_sampler,
        collate_fn=collate_fn,
        num_workers=num_workers,
    )

adebiasio-estilos commented 1 year ago

Uhmm, I actually tried but I still get the previous results with NCF.

Btw, don't know if you made some recent updates to the library, I downloaded the version of the last week.

Moreover, what version of pandas, numpy, tensorflow, torch and other libraries that are involved are you using?

massquantity commented 1 year ago

Yep, you should use the latest commit version. I think recent updates may affect this.

numpy 1.23.4 pandas 1.4.3 TensorFlow 2.12.0 torch 2.0.1 scikit-learn 1.1.1 scipy 1.8.1

My results on NCF: Screenshot from 2023-07-29 19-32-34

adebiasio-estilos commented 1 year ago

So, I made a test also with the last commit and the versions of the libraries you put in the previous commit but I still get strange results with NCF

I don't know if it may depend on cuda... I listed all my python packages at the end of the following attached notebook. Is there any differences with yours?

Test_2.zip

massquantity commented 1 year ago

What about CPU results? I'm using CPU since currently i can only have access to my laptop. Based on your list I don't think packages are an issue now.

adebiasio-estilos commented 1 year ago

You are right, with only CPUs (i.e., setting os.environ["CUDA_VISIBLE_DEVICES"] = "-1" ) it works

But what may happen when using GPUs then?

massquantity commented 1 year ago

I'm also uncertain about it. Try this, https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy

adebiasio-estilos commented 1 year ago

So, I actually tried setting:

random.seed(42)
np.random.seed(42)

#tf.random.set_seed(42) # 'tensorflow.compat.v1.random' has no attribute 'set_seed'
#tf.experimental.numpy.random.seed(42) # 'tensorflow.compat.v1.experimental' has no attribute 'numpy'
tf.set_random_seed(42) # 'tensorflow' has no attribute 'set_random_seed'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1' 
#os.environ['TF_DETERMINISTIC_OPS'] = '1' # Determinism is not yet supported in GPU implementation of Scatter ops with ref inputs.

torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

os.environ['PYTHONHASHSEED']=str(42)

prior executing all the code, in the build_model() function, and in the get_batch_loader() function but I'm still getting non-reproducible results with NCF when trained on GPUs.

Moreover, consider that since we are using tensorflow.compat.v1, some features are not supported:

tf.random.set_seed(42) -> 'tensorflow.compat.v1.random' has no attribute 'set_seed'
tf.experimental.numpy.random.seed(42) -> 'tensorflow.compat.v1.experimental' has no attribute 'numpy'

And there is a strange error when enabling determinism:

os.environ['TF_DETERMINISTIC_OPS'] = '1' -> Determinism is not yet supported in GPU implementation of Scatter ops with ref inputs. Consider using resource variables instead if you want to run Scatter when op determinism is enabled. [[{{node Adam/update_embedding/bu_var/ScatterAdd}}]]

massquantity commented 1 year ago

Although we mainly use tf1, the package installed is tf2, which may invoke some issues. I think you can set both tf1 and tf2 seed:

import tensorflow as tf2

tf2.set_random_seed(42)
tf2.random.set_seed(42)
tf2.experimental.numpy.random.seed(42)
...

import tensorflow
tensorflow.compat.v1.set_random_seed(42)
...

adebiasio-estilos commented 1 year ago

So, I actually put:

import random
import numpy as np
import torch
import os
import tensorflow as tf2
from libreco.tfops import tf

random.seed(42)
np.random.seed(42)

torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

tf2.random.set_seed(42)
tf2.experimental.numpy.random.seed(42)
tf.set_random_seed(42)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
#os.environ['TF_DETERMINISTIC_OPS'] = '1' # this is still commented due to previous bug

os.environ['PYTHONHASHSEED']=str(42)

prior executing all the code, in the build_model() function and in the get_batch_loader() function.

I'm still getting non-reproducible results with NCF when trained with GPUs.

What I've not mentioned earlier is that when training SVD with GPUs I instead obtain reproducible results...

What's the difference between the models? aren't both based on tensorflow?

massquantity commented 1 year ago

Yes they are both implemented in tf. The main difference is that NCF uses additional dense layers.

I've asked ChatGPT, and here is what I got,

massquantity commented 1 year ago

In TensorFlow 1.x, setting the random seed for GPU operations requires additional steps compared to setting the random seed for CPU operations. This is because GPU operations involve additional sources of randomness that are not directly controlled by the TensorFlow random seed.

To ensure deterministic behavior with GPU operations in TensorFlow 1.x, you need to follow these steps:

Set the global random seed: Set the random seed using tf.set_random_seed(seed). This will seed the random number generator for CPU operations.
Configure GPU behavior: To make GPU operations deterministic, you need to set the environment variable CUDA_VISIBLE_DEVICES to restrict TensorFlow to use only one visible GPU. This step is necessary because multiple GPUs might introduce non-determinism due to their asynchronous nature.

Additionally, you can set the environment variable TF_CUDNN_USE_AUTOTUNE to 0 to disable cuDNN's auto-tuner, which can introduce non-determinism in cuDNN-based operations.

Here's how you can set the random seed for GPU operations in TensorFlow 1.x:

import tensorflow.compat.v1 as tf

# Set the random seed for TensorFlow CPU operations
seed = 42
tf.set_random_seed(seed)

# Configure GPU behavior to make it deterministic
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Set this to the GPU you want to use (e.g., GPU 0)

# Optionally, disable cuDNN's auto-tuner
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"

Please note that the steps mentioned above are specifically for TensorFlow 1.x. In TensorFlow 2.x, the process for setting random seeds and ensuring determinism with GPU operations has been simplified. In TensorFlow 2.x, you can typically achieve determinism by setting the random seed without additional steps for GPU operations. However, the specific details may vary depending on the version and the GPU backend being used.

adebiasio-estilos commented 1 year ago

Ok.. So I also added:

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"

as per chatgpt suggestion.

But still get struggle with NCF unfortunately :(

massquantity commented 1 year ago

Yeah, looks like there is nothing we can do about this problem :)

adebiasio-estilos commented 1 year ago

we can only cry XD Anyway, besides this curious issue the library is very good. I'm going to use it for an incremental training project

adebiasio-estilos commented 1 year ago

I found that when using batch_size=256 I get non-reproducible results, but when reducing batch_size to 128 I get (almost) reproducible ones

LOL

massquantity / LibRecommender

reproducibility issues #357