Open karlhigley opened 4 years ago
The base Model
class itself probably shouldn't define these methods: it may well be used to implement non-factorized ranking models which do not employ the dot-product method of making predictions, or more complex multi-task models that make many predictions.
In general, we aim for composition of layers over large base classes. Along those lines, we could write a score computation layer and use it to replace the piece of code you link to. I am on the fence about the code deduplication benefits of this.
Additionally, we don't define call
to return predictions on purpose: being able to separate query and candidate representations is incredibly important both for evaluation and serving of factorized models (where we want to compute a query representation once over many batches of candidates).
The base Model class itself probably shouldn't define these methods: it may well be used to implement non-factorized ranking models which do not employ the dot-product method of making predictions, or more complex multi-task models that make many predictions.
Right, I'm not suggesting that all models will be factorized (the opposite, in fact.) I'm saying that the current structure of the library seems to bake the assumption of factorized models into some things it probably shouldn't (like metrics.)
Since the base Model
class doesn't even define an abstract call()
placeholder (raising NotImplementedError
) though, other code throughout the library can't rely on the existence of a (possibly non-factorized) method of making predictions using the model's forward pass. So, the forward pass ends up getting implemented in multiple places, leading to code duplication and increased potential for errors.
(AFAIK, there's no restriction that would prevent call()
from returning multiple predictions, right?)
Additionally, we don't define call to return predictions on purpose: being able to separate query and candidate representations is incredibly important both for evaluation and serving of factorized models (where we want to compute a query representation once over many batches of candidates).
I'm with you on using the model a bit differently for serving. That's why I think there might be two methods here: one which computes predictions on a one-to-one basis between queries and candidates (i.e. the forward pass, used during training), and one of which computes many-to-many predictions between query vectors and candidate vectors (i.e. bulk predictions, used for evaluation and retrieval.)
It's not quite clear to me how foundational the concept of retrieving candidates based solely on factorized query and candidate embeddings is intended to be in this library. Seems reasonable that it might be intended to be used for all models (in which case it would make sense for the Model
class to support it directly), or might be only one of several different ways of performing retrieval that the library is intended to support (in which case a FactorizedModel
class might make more sense.)
I realize there's already a DatasetTopK
layer to compose with other layers for retrieval; it assumes a factorized model without actually being labeled as such. Meanwhile, the Model
class assumes nothing and it seems like maybe you want to keep the idea of factorized models outside it? Hence my confusion.
This is basically what I'm imagining:
import tensorflow as tf
class FactorizedModel(models.Model):
def __init__(self, candidate_dataset):
super().__init__()
self.query_model = tf.keras.layers.Dense(16)
self.candidate_model = tf.keras.layers.Dense(16)
self.tasks = [
# tasks.Retrieval(model=self, ...)
# tasks.Ranking(model=self, ...)
]
# Maintains API compatibility with other TF models
def call(self, features):
return self.predict(features)
# Makes one-to-one predictions between queries and candidates
def predict(self, features):
query_features, candidate_features = features
query_emb = self.query_model(query_features)
candidate_emb = self.candidate_model(candidate_features)
scores = tf.reduce_sum(query_emb * candidate_emb, axis=1, keepdims=True)
return scores
# Makes many-to-many predictions between queries and candidates
def bulk_predict(self, features):
query_features, candidate_features = features
query_emb = self.query_model(query_features)
candidate_emb = self.candidate_model(candidate_features)
scores = tf.matmul(query_emb, candidate_emb, transpose_b=True)
return scores
# Computes losses for the tasks registered above
# (Could probably be extracted to base class as a default implementation)
def compute_loss(self, inputs, training: bool = False) -> tf.Tensor:
features, labels = inputs
predictions = self(features)
loss = sum([task(predictions, labels) for task in self.tasks])
return loss
Then the tasks and evaluations can rely on the model to know how to compute its own predictions, both for computing losses during training and for scoring items during evaluation/retrieval.
Right, I'm not suggesting that all models will be factorized (the opposite, in fact.) I'm saying that the current structure of the library seems to bake the assumption of factorized models into some things it probably shouldn't (like metrics.)
The FactorizedTopK
metric assumes a factorized model because it's meant to be a specialized for use with factorized models. For models that are not factorized, we can use standard Keras metrics (like AUC) or ranking-oriented metrics from TensorFlow Ranking (example, we have an upcoming tutorial that shows how to do this).
Since the base
Model
class doesn't even define an abstractcall()
placeholder (raisingNotImplementedError
) though, other code throughout the library can't rely on the existence of a (possibly non-factorized) method of making predictions using the model's forward pass. So, the forward pass ends up getting implemented in multiple places, leading to code duplication and increased potential for errors.
We have actually toyed with a lot of the alternatives you suggest; in particular, the Retrieval
task exposing scoring methods that are then used in factorized prediction in evaluation. This gives you the benefit of centralizing the prediction computation. However, we've found that the factorized prediction computation is always a dot product, and so hard-coding this in a bunch of places give us simpler code and a more streamlined API at no cost that we could see.
I think what would really help me see the issue is an example of a use case that's made difficult or confusing by the way we've set up the API.
How would you build a DCN using this framework as currently constructed? Or a factorization machine?
Like, if this is a toolkit to build specific kinds of factorized two-tower models and I'm supposed to use normal Keras features and ranking metrics from elsewhere for other kinds of models, that's fine—but it makes this library a lot less useful to me (and also kind of confusingly named.)
A DCN is a component for building models rather than a model in itself, and we're adding a DCN layer right now! It'll be here in a couple of weeks. Once it's here, you can use it as a building block for your model, just as any other layer:
my_submodel = tf.keras.Sequential([
DCN(...hyperparameters...),
Dense(...),
Dense(...),
])
predictions = my_submodel(inputs)
You can use that as a component of either a factorized or non-factorized model, depending on what makes sense to you.
A factorization machine layer would in principle it would be just another component, for you to use as you please as part of a larger model.
I think the principle we'd like to follow in developing this library is to (1) add specialized recommender system components where these are lacking in the wider TensorFlow ecosystem, and (2) put together good tutorials on how to assemble these components into models for both common and more sophisticated recommendation tasks. However, if a perfectly good component is available in plain TF or Keras (such as metrics/CNN/RNN layers), we'll definitely use those.
I don't want to be Reviewer Number 2, but...
I have experience with recs using Tensorflow, really appreciate your previous work (as you know), want to use this library, and found it confusing enough I wanted to smash my keyboard. The WTFs per minute reading the code were pretty high for me—not because there's anything wrong with the code as code, but because the concepts are not clear from reading the code.
There are a bunch of things specialized for factorized models, but only some of them are labeled. Which things are specialized and which things aren't is surprising. The same concepts are implemented in multiple places in slightly different ways. The batteries aren't included, in the sense that using this library requires knowledge of a bunch of other TF libraries to fill in pieces that seem like they ought to be covered here but aren't.
None of those things would stop me if it seemed like I could contribute to improving the library (which I have time, energy, bandwidth, and motivation to do), but it sounds like the issues I've encountered are the result of the design philosophy of the library, so... :shrug:
That sound unpleasant, hope the keyboard is OK!
Could you give me one (or two) examples of where you ran into a WTF moment? Something along these lines:
Just to say that we're taking your feedback seriously - the 0.2.0 release introduced some changes that aim to make it clearer what layers and metrics are intended for use with factorized models.
Since the base
Model
class doesn't implement (at least a placeholder for) thecall()
method, other parts of the library can't rely on that method being defined. As a result, the details of how to compute predictions from a model show up in multiple other places:There are two flavors represented among those examples:
matmul
sIn order to consolidate the prediction code and abstract the way predictions are computed, it seems like these two modes could either be captured in a single
Model
method with a flag that selects between element-wise pairs and batch predictions, or represented as twoModel
methods.Thoughts?