google-research / recsim

A Configurable Recommender Systems Simulation Platform
https://github.com/google-research/recsim
Apache License 2.0
732 stars 127 forks source link

Returning same slate after every iteration #21

Open saxena-priyansh opened 3 years ago

saxena-priyansh commented 3 years ago

Thanks for the great work @cwhsu-google. Our team is trying to use RecSim for slate recommendation.

After training the agent (slate_decomp_q_agent) for 300k steps. I tried loading different checkpoints and to generate slates for the same user (to understand convergence of q values) but the slates returned after every iteration are the same.

Here is my script that I used for prediction:

inference.py

from recsim.environments import interest_evolution  
from recsim.agents import slate_decomp_q_agent  

def create_decomp_q_agent(sess, environment, eval_mode, summary_writer=None):  
    """  
 This is one variant of the agent featured in SlateQ paper """  kwargs = {  
      'observation_space': environment.observation_space,  
  'action_space': environment.action_space,  
  'summary_writer': summary_writer,  
  'eval_mode': eval_mode,  
  }  
    return slate_decomp_q_agent.create_agent(agent_name='slate_topk_sarsa', sess=sess, **kwargs)  

seed = 0  
slate_size = 3  
np.random.seed(seed)  
env_config = {  
  'num_candidates': 30,  
  'slate_size': slate_size,  
  'resample_documents': True,  
  'seed': seed,  
}  

tmp_decomp_q_dir = '../results12/'  

user_vec = [-0.00598616, 0.1760635, -0.0913329, 0.59239239, -0.90903912,  
  -0.17019989, 0.00312255, -0.32639151, -0.5325127, -0.47683574,  
  -0.86847277, 0.32046379, -0.56788602, -0.69480169, 0.071154,  
  0.33922171, 0.04820297, 0.97037383, 0.04213649, -0.16748408]  

user_obs = np.array(user_vec)  
print('Shape of user observation:', user_obs.shape)  
runner = prediction.PredRunner(  
      base_dir=tmp_decomp_q_dir,  
  create_agent_fn=create_decomp_q_agent,  
  env=interest_evolution.create_environment(env_config))  
print('Going to predict...')  
start_time = time.time()  
print(runner.predict(user_obs_features=user_obs))  
print('Prediction Time taken', time.time()-start_time, 'seconds')

prediction.py

import os
import time
from dopamine.discrete_domains import checkpointer
from recsim.simulator.runner_lib import Runner
import tensorflow.compat.v1 as tf

class PredRunner(Runner):
    def __init__(self,
                 train_base_dir=None,
                 **kwargs):
        st = time.time()
        super(PredRunner, self).__init__(**kwargs)
        self._output_dir = os.path.join(self._base_dir, 'pred')
        tf.io.gfile.makedirs(self._output_dir)
        if train_base_dir is None:
            train_base_dir = self._base_dir
        self._checkpoint_dir = os.path.join(train_base_dir, 'train', 'checkpoints')
        self._set_up(eval_mode=True)
        # Use the checkpointer class.
        self._checkpointer = checkpointer.Checkpointer(
            self._checkpoint_dir, self._checkpoint_file_prefix)
        checkpoint_version = -1
        latest_checkpoint_version = checkpointer.get_latest_checkpoint_number(
            self._checkpoint_dir)
        latest_checkpoint_version = 100
        print('Checkpoint that would be read:', latest_checkpoint_version)
        # checkpoints_iterator already makes sure a new checkpoint exists.
        if latest_checkpoint_version <= checkpoint_version:
            time.sleep(self._min_interval_secs)
        experiment_data = self._checkpointer.load_checkpoint(
            latest_checkpoint_version)
        assert self._agent.unbundle(self._checkpoint_dir,
                                    latest_checkpoint_version, experiment_data)
        # Saving weights to file for debugging
        tvars = tf.trainable_variables()
        tvars_vals = self._sess.run(tvars)
        var_list = []
        tensor_list = []
        for var, val in zip(tvars, tvars_vals):
            var_list.append(var.name)
            tensor_list.append(val)
        import pandas as pd
        df = pd.DataFrame({'var': var_list, 'tensor': tensor_list})
        df.to_pickle('youtube-test-weights{}.pickle'.format(latest_checkpoint_version))
        print('Model loading time taken: {}'.format(time.time() - st))

    def predict(self, user_obs_features):
        st = time.time()
        self._env.reset_sampler()
        self._initialize_metrics()
        observation = self._env.reset()
        observation['user'] = user_obs_features
        start = time.time()
        action = self._agent.begin_episode(observation)
        print('Step time taken: {}'.format(time.time() - start))
        slate = [0] * len(action)
        doc_keys = list(observation['doc'].keys())
        for i in range(len(action)):
            slate[i] = doc_keys[action[i]]
        print('Time taken: {} ms'.format(1000*(time.time()-st)))
        return slate

These graphs were generated on tensorboard:

Screenshot 2020-12-08 at 5 40 54 PM Screenshot 2020-12-08 at 5 41 07 PM Screenshot 2020-12-08 at 5 41 20 PM Screenshot 2020-12-08 at 5 41 31 PM Screenshot 2020-12-08 at 5 41 55 PM Screenshot 2020-12-08 at 5 42 03 PM Screenshot 2020-12-08 at 5 42 44 PM Screenshot 2020-12-08 at 5 43 25 PM Screenshot 2020-12-08 at 5 43 36 PM

Most importantly I am looking answers for the following

Any help would be appreciated

vihanjain commented 3 years ago

Hi, thanks for using RecSim. I can help provide some pointers to help with further debugging:

  1. Why q values over different epochs are turning out to be same? Not clear which q values is mentioned here? Are you manually printing q values? Or is this referring to one of the Tensorboard charts attached in the question?

  2. Which in turn is returning same slates for all the checkpoints Hmm, this is concerning indeed. How many examples/users are you evaluating per checkpoint? Can you double check (e.g., by logging) that the inference code is indeed using different checkpoints? Can you try two things: (1) try evaluating more (say 100) users per checkpoint and compare the slates (you can sample users from a normal distribution). (2) evaluate checkpoints further spaced apart (e.g., 50k, 100k, 150k, 200k etc).

  3. This raises question, whether model is training or not I do see that the average episodic reward is improving with number of steps, so my guess is it must be training. Have you tried hparam tuning or are you using same network parameters from the default implementation?

  4. Also we see watch time for each video is 4 min, since q values reflect cumulative reward over state, action pair, how come their scale is 10exp-2 Again not sure which q values are referred here. The average episodic length is ~60 and average episodic reward is ~160 which seems in the right range as there will be some small negative rewards for some actions as well.

Hope this helps!

saxena-priyansh commented 3 years ago

Thanks @vihanjain for your response. To elaborate on this:

What we have tried so far to debug:

LOGS...

Checkpoint that would be read: 70
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-70
Model loading time taken: 20.099586963653564
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.3186337947845459
Time taken: 320.206880569458 ms
['51', '31', '52']
Prediction Time taken 0.3202650547027588 seconds

Checkpoint that would be read: 80
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-80
Model loading time taken: 20.884052991867065
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.29644083976745605
Time taken: 297.63102531433105 ms
['51', '31', '52']
Prediction Time taken 0.2976522445678711 seconds

Checkpoint that would be read: 90
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-90
Model loading time taken: 19.210211753845215
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.3162529468536377
Time taken: 317.17681884765625 ms
['51', '31', '52']
Prediction Time taken 0.3171958923339844 seconds

Checkpoint that would be read: 100
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-100
Model loading time taken: 22.500069856643677
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.29529881477355957
Time taken: 296.33116722106934 ms
['51', '31', '52']
Prediction Time taken 0.29636096954345703 seconds

Below ss shows weights in different iterations: networks/network weights are changing, but networks/network_1 & networks/network_2 are same

Screenshot 2020-12-15 at 11 44 13 PM
saxena-priyansh commented 3 years ago

@vihanjain any help would be appreciated.

getsanjeevdubey commented 3 years ago

@vihanjain Let us know your thoughts on this, also if some more details are required?