Returning same slate after every iteration

saxena-priyansh commented 3 years ago

Thanks for the great work @cwhsu-google. Our team is trying to use RecSim for slate recommendation.

After training the agent (slate_decomp_q_agent) for 300k steps. I tried loading different checkpoints and to generate slates for the same user (to understand convergence of q values) but the slates returned after every iteration are the same.

Here is my script that I used for prediction:

inference.py

from recsim.environments import interest_evolution  
from recsim.agents import slate_decomp_q_agent  

def create_decomp_q_agent(sess, environment, eval_mode, summary_writer=None):  
    """  
 This is one variant of the agent featured in SlateQ paper """  kwargs = {  
      'observation_space': environment.observation_space,  
  'action_space': environment.action_space,  
  'summary_writer': summary_writer,  
  'eval_mode': eval_mode,  
  }  
    return slate_decomp_q_agent.create_agent(agent_name='slate_topk_sarsa', sess=sess, **kwargs)  

seed = 0  
slate_size = 3  
np.random.seed(seed)  
env_config = {  
  'num_candidates': 30,  
  'slate_size': slate_size,  
  'resample_documents': True,  
  'seed': seed,  
}  

tmp_decomp_q_dir = '../results12/'  

user_vec = [-0.00598616, 0.1760635, -0.0913329, 0.59239239, -0.90903912,  
  -0.17019989, 0.00312255, -0.32639151, -0.5325127, -0.47683574,  
  -0.86847277, 0.32046379, -0.56788602, -0.69480169, 0.071154,  
  0.33922171, 0.04820297, 0.97037383, 0.04213649, -0.16748408]  

user_obs = np.array(user_vec)  
print('Shape of user observation:', user_obs.shape)  
runner = prediction.PredRunner(  
      base_dir=tmp_decomp_q_dir,  
  create_agent_fn=create_decomp_q_agent,  
  env=interest_evolution.create_environment(env_config))  
print('Going to predict...')  
start_time = time.time()  
print(runner.predict(user_obs_features=user_obs))  
print('Prediction Time taken', time.time()-start_time, 'seconds')

prediction.py

import os
import time
from dopamine.discrete_domains import checkpointer
from recsim.simulator.runner_lib import Runner
import tensorflow.compat.v1 as tf

class PredRunner(Runner):
    def __init__(self,
                 train_base_dir=None,
                 **kwargs):
        st = time.time()
        super(PredRunner, self).__init__(**kwargs)
        self._output_dir = os.path.join(self._base_dir, 'pred')
        tf.io.gfile.makedirs(self._output_dir)
        if train_base_dir is None:
            train_base_dir = self._base_dir
        self._checkpoint_dir = os.path.join(train_base_dir, 'train', 'checkpoints')
        self._set_up(eval_mode=True)
        # Use the checkpointer class.
        self._checkpointer = checkpointer.Checkpointer(
            self._checkpoint_dir, self._checkpoint_file_prefix)
        checkpoint_version = -1
        latest_checkpoint_version = checkpointer.get_latest_checkpoint_number(
            self._checkpoint_dir)
        latest_checkpoint_version = 100
        print('Checkpoint that would be read:', latest_checkpoint_version)
        # checkpoints_iterator already makes sure a new checkpoint exists.
        if latest_checkpoint_version <= checkpoint_version:
            time.sleep(self._min_interval_secs)
        experiment_data = self._checkpointer.load_checkpoint(
            latest_checkpoint_version)
        assert self._agent.unbundle(self._checkpoint_dir,
                                    latest_checkpoint_version, experiment_data)
        # Saving weights to file for debugging
        tvars = tf.trainable_variables()
        tvars_vals = self._sess.run(tvars)
        var_list = []
        tensor_list = []
        for var, val in zip(tvars, tvars_vals):
            var_list.append(var.name)
            tensor_list.append(val)
        import pandas as pd
        df = pd.DataFrame({'var': var_list, 'tensor': tensor_list})
        df.to_pickle('youtube-test-weights{}.pickle'.format(latest_checkpoint_version))
        print('Model loading time taken: {}'.format(time.time() - st))

    def predict(self, user_obs_features):
        st = time.time()
        self._env.reset_sampler()
        self._initialize_metrics()
        observation = self._env.reset()
        observation['user'] = user_obs_features
        start = time.time()
        action = self._agent.begin_episode(observation)
        print('Step time taken: {}'.format(time.time() - start))
        slate = [0] * len(action)
        doc_keys = list(observation['doc'].keys())
        for i in range(len(action)):
            slate[i] = doc_keys[action[i]]
        print('Time taken: {} ms'.format(1000*(time.time()-st)))
        return slate

These graphs were generated on tensorboard:

Most importantly I am looking answers for the following

Why q values over different epochs are turning out to be same?
Which in turn is returning same slates for all the checkpoints
This raises question, whether model is training or not
Also we see watch time for each video is 4 min, since q values reflect cumulative reward over state, action pair, how come their scale is 10exp-2

Any help would be appreciated

vihanjain commented 3 years ago

Hi, thanks for using RecSim. I can help provide some pointers to help with further debugging:

Why q values over different epochs are turning out to be same? Not clear which q values is mentioned here? Are you manually printing q values? Or is this referring to one of the Tensorboard charts attached in the question?
Which in turn is returning same slates for all the checkpoints Hmm, this is concerning indeed. How many examples/users are you evaluating per checkpoint? Can you double check (e.g., by logging) that the inference code is indeed using different checkpoints? Can you try two things: (1) try evaluating more (say 100) users per checkpoint and compare the slates (you can sample users from a normal distribution). (2) evaluate checkpoints further spaced apart (e.g., 50k, 100k, 150k, 200k etc).
This raises question, whether model is training or not I do see that the average episodic reward is improving with number of steps, so my guess is it must be training. Have you tried hparam tuning or are you using same network parameters from the default implementation?
Also we see watch time for each video is 4 min, since q values reflect cumulative reward over state, action pair, how come their scale is 10exp-2 Again not sure which q values are referred here. The average episodic length is ~60 and average episodic reward is ~160 which seems in the right range as there will be some small negative rewards for some actions as well.

Hope this helps!

saxena-priyansh commented 3 years ago

Thanks @vihanjain for your response. To elaborate on this:

Not clear which q values is mentioned here? Are you manually printing q values? Or is this referring to one of the Tensorboard charts attached in the question? We are refering to the q-values being printed while training/evaluation through https://github.com/google-research/recsim/blob/master/recsim/agents/slate_decomp_q_agent.py#L618 We have also tried printing values manually by putting a tf.Print operation here at https://github.com/google-research/recsim/blob/master/recsim/agents/slate_decomp_q_agent.py#L510 q_values = tf.Print(q_values, [q_values], 'q_values', summarize=1000).
How many examples/users are you evaluating per checkpoint? We are evaluating for one user across various checkpoints (saved after different no of steps, 330k steps in total, with checkpoint_freq=10, essentially after every 33k steps). The idea is to check for convergence of q-values associated with each document for same user; with time. If we change the user, we get different q-values, as expected. But if they are not converging, they could just be random for any user (only difference in user_vector is causing the change in q-val).
Yes graphs do tell, training is going fine. We seem to have trouble with evaluation.
About scale of q values, they seem to be way lower than expected. We have attached the exact values below.

What we have tried so far to debug:

To check if training is going properly, we have taken out weights of checkpoint-initialised network (above code includes the small snippet to do so) and used them to initialise a new keras network (default architecture i/p->256+relu->32+relu->1). The weight values are different over different epoch/iteration (attached screenshot below). Doing a forward pass with this network with same input gives different q-values over epochs (the scale of these q-values turned out to be between 3.0 and 4.2). Is this scale correct? This raises doubts, if we are able to initialise the weights correctly in EvalRunner/PredRunner.
Also could not understand how are we maintaining mapping of document_id vs corresponding deep network (its q value approximator). To our understanding we have a different 2-hidden-layer network for each doc.
What can be possible use of resampling documents before each episode?

LOGS...

Checkpoint that would be read: 70
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-70
Model loading time taken: 20.099586963653564
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.3186337947845459
Time taken: 320.206880569458 ms
['51', '31', '52']
Prediction Time taken 0.3202650547027588 seconds

Checkpoint that would be read: 80
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-80
Model loading time taken: 20.884052991867065
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.29644083976745605
Time taken: 297.63102531433105 ms
['51', '31', '52']
Prediction Time taken 0.2976522445678711 seconds

Checkpoint that would be read: 90
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-90
Model loading time taken: 19.210211753845215
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.3162529468536377
Time taken: 317.17681884765625 ms
['51', '31', '52']
Prediction Time taken 0.3171958923339844 seconds

Checkpoint that would be read: 100
Not in early execution... model_weights/results12/train/checkpoints/tf_ckpt-100
Model loading time taken: 22.500069856643677
Going to predict...
q_values[[-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]]
[cp 1][21 1 22][0.016246004 0.0518380962 0.0206836071 0.0115500968 0.0287562609 0.0341853239 0.0287562609 0.0756491944 0.0241801739 0.0242459327 0.0300822686 0.0402437486 0.0284955166 0.0307806712 0.0206836071 0.0284955166 0.0120282751 0.0394958928 0.0299003273 0.0284955166 0.0115500968 0.0756491944 0.0402437486 0.0341853239 0.0341853239 0.0299003273 0.0284955166 0.0120282751 0.0394958928 0.0115500968][-0.13537176 0.234581396 0.259239793 0.0990562439 0.00532283308 -0.0287395343 0.107342854 -0.044202745 -0.00707248785 0.0450880975 -0.207072049 0.027151769 0.134151459 0.0761770904 -0.413350075 0.066539 0.31752333 -0.0904344171 -0.170975745 0.195714355 0.0547820404 0.166659713 0.207024947 -0.199312985 -0.418331027 -0.00844216906 0.0422177538 0.0487211384 -0.0899787843 0.0510536507]
Step time taken: 0.29529881477355957
Time taken: 296.33116722106934 ms
['51', '31', '52']
Prediction Time taken 0.29636096954345703 seconds

Below ss shows weights in different iterations: networks/network weights are changing, but networks/network_1 & networks/network_2 are same

saxena-priyansh commented 3 years ago

@vihanjain any help would be appreciated.

getsanjeevdubey commented 3 years ago

@vihanjain Let us know your thoughts on this, also if some more details are required?

google-research / recsim

Returning same slate after every iteration #21