tensorflow / ranking

Learning to Rank in TensorFlow
Apache License 2.0
2.74k stars 473 forks source link

Considering adding complex DL models such as Wide & Deep Learning Model #171

Closed LiangqunLu closed 4 years ago

LiangqunLu commented 4 years ago

Hi,

It is apparent that TF-Ranking is utilizing 3-layer MLP for learning to rank. So I am considering adding other complex models demonstrating abilities for recommender systems or CTR to TFR library. The Wide & Deep learning from Google 2016 uses a wide structure for memorization and a DNN for generation. I tried to add a linear combination part to _score_fn, then use both wide and deep logits to optimize for listwise-based softmax loss. I am pasting my code here. Can someone have a look and point out if I make any mistakes? Thanks!

  def _score_fn(unused_context_features, group_features, mode, unused_params,
                unused_config):
    """Defines the network to score a group of documents."""
    with tf.compat.v1.name_scope("input_layer"):
      group_input = [
          tf.compat.v1.layers.flatten(group_features[name])
          for name in sorted(example_feature_columns())
      ]
      input_layer = tf.concat(group_input, 1)
#      tf.compat.v1.summary.scalar("input_sparsity",
#                                  tf.nn.zero_fraction(input_layer))
#      tf.compat.v1.summary.scalar("input_max",
#                                  tf.reduce_max(input_tensor=input_layer))
#      tf.compat.v1.summary.scalar("input_min",
#                                  tf.reduce_min(input_tensor=input_layer))

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)
    cur_layer = input_layer

    cur_layer = tf.compat.v1.layers.batch_normalization(
        input_layer, training=is_training)

    for i, layer_width in enumerate(int(d) for d in FLAGS.hidden_layer_dims):
      cur_layer = tf.compat.v1.layers.dense(cur_layer, units=layer_width)
      cur_layer = tf.compat.v1.layers.batch_normalization(
          cur_layer, training=is_training)
      cur_layer = tf.nn.relu(cur_layer)

#      tf.compat.v1.summary.scalar("fully_connected_{}_sparsity".format(i),
#                                  tf.nn.zero_fraction(cur_layer))
      cur_layer = tf.compat.v1.layers.dropout(
            cur_layer, rate=FLAGS.dropout_rate, training=is_training)

    logits = tf.compat.v1.layers.dense(cur_layer, units=FLAGS.group_size)

    print("Checkpoint 1: ", logits)

    dnn_logits = logits

    ## Build linear classifier
    group_input = [
      tf.compat.v1.layers.flatten(group_features[name])
      for name in sorted(example_feature_columns())
    ]
    input_layer = tf.concat(group_input, 1)

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)
    cur_layer = tf.compat.v1.layers.batch_normalization(
        input_layer, training=is_training)

    cur_layer = tf.compat.v1.layers.dense(cur_layer, units=2)
    #cur_layer = tf.compat.v1.layers.batch_normalization(
        #cur_layer, training=is_training)
    #cur_layer = tf.nn.sigmoid(cur_layer)

    linear_logits = tf.compat.v1.layers.dense(cur_layer, units=FLAGS.group_size)    

    print("Checkpoint 2: ", logits)    

    # Combine logits and build full model.
    if dnn_logits is not None and linear_logits is not None:
      logits = dnn_logits + linear_logits
    elif dnn_logits is not None:
      logits = dnn_logits
    else:
      logits = linear_logits

    if _use_multi_head():
      # Duplicate the logits for both heads.
      return {_PRIMARY_HEAD: logits, _SECONDARY_HEAD: logits}
    else:
      return logits

  return _score_fn
eggie5 commented 4 years ago

Beyond your modeling question, you might want to, if you haven't yet, think about the core data structure of TF ranking. TF Ranking operates on a Query Data Structure composed of a Context feature set and Example documents feature sets. You will have to consider how you will fit your recommender data into this paradigm. For example, typical recsys data is tuples of (user, item, click).

Also one issue you might encounter, is that a lot of recsys algorithms depend on global negative sampling. TF Ranking cannot do that directly, it only does sampling within the Query documents.

LiangqunLu commented 4 years ago

Thanks, Alex! I thought about that issue and I came up with a simple solution. For the Wide linear combination which I wanted to add, I just used TF-Ranking MLP structure and added another 1 dense layer as a linear cross. In this case, the input data is the exactly the same and I saved the trouble to modify the input. I just am not quite sure whether it is correct?

eggie5 commented 4 years ago

You seem to be referencing some previous example you are modifying: "the input data is the exactly the same" same as what?

LiangqunLu commented 4 years ago

Awkward.. It is exactly the same as TF-Ranking MLP structure. I used the example script tf_ranking_libsvm.py.

# In the TF-Ranking canned estimator
# This basically defines the feature columns, feature transformation, model structures, loss and metrics. 
estimator = tf.estimator.Estimator(
      model_fn=tfr.model.make_groupwise_ranking_fn(
          group_score_fn=make_score_fn(),
          group_size=FLAGS.group_size,
          transform_fn=make_transform_fn(),
          ranking_head=ranking_head ),
      config=tf.estimator.RunConfig(
          FLAGS.output_dir, save_checkpoints_steps=1000))
def make_score_fn():
    '''
     MLP structure is defined, which is supposed to handle feature columns, calculate parameters and return logits. 
    '''

So what I said above is that I made use of the TF-Ranking input. I only added a wide structure inside the TF-Ranking make_score_fn

eggie5 commented 4 years ago

Ok, so you are using libsvm as your input format. I think the example that comes w/ TF Ranking is a sample of MSLR-WEB30k, which is composed of rel, qid, features tuples. I'm just curious how you will retrofit this data format for recsys data?

LiangqunLu commented 4 years ago

Yeah, I use libsvm input format that basically consists label/rel, queries and features. In my recsys project, it is pretty much the same except my labels are binary. The queries have various sizes of documents. However, my results using AUC and NDCG@3 are not satisfactory. Indeed, AUC is ~0.51 when I used listwise-based softmax loss.

eggie5 commented 4 years ago

what are your features? In typical recsys Latent Factor Model, it's just a User ID and an Item ID...

LiangqunLu commented 4 years ago

Oh I maybe mixed a little bit. My project is Learning to rank more precisely. The features in the libsvm format are item embeddings and use embeddings. But I used TF Ranking for the ranking and did not need to use matrix factorization at this moment.

LiangqunLu commented 4 years ago

Hi @eggie5

I read through quite a few closed issues and learned that TF-Ranking is working finally in your platform. If so, can I ask how you set up the parameters and what metrics are mainly evaluated? For the parameters, I mean the list of group-size, list_size, loss function, dropout_rate and so on. For the metrics, do you evaluate ARP or OPA or NDCG? Did you implement L2 regularization eventually? I understand these variables may vary on different projects. I appreciate it if I can learn some ideas from the successful examples.

eggie5 commented 4 years ago

I've used it for a few projects now. One recommender and one LTR system. For the recommender I build a Latent Factor model, a Deep Triplet Network. For the LTR system, I used a Two Tower Network. Most of my models are trained on clickstream data with binary feedback, so that helped drive some choice of loss and metrics like the pairwise losses and OPA were very relevant and interpretable. With clickstream feedback, MRR is possibly a meaningful metric, but you must realize that a typically great MRR of 0.5 is actually the lower-bound random-classifier in this case! Also in the binary feedback setting NDCG pretty much recovers MRR and tracks it in my experience. Regarding L2, I think there's an issue I started here w/ details.

bendersky commented 4 years ago

Closing this discussion for now. A lot of good of advice here, we may use it for future reference.