Model implementation details leak outside the model

tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.

Apache License 2.0

1.84k stars 275 forks source link

Model implementation details leak outside the model #99

Open karlhigley opened 4 years ago

karlhigley commented 4 years ago

Since the base Model class doesn't implement (at least a placeholder for) the call() method, other parts of the library can't rely on that method being defined. As a result, the details of how to compute predictions from a model show up in multiple other places:

The example model evaluation
The retrieval task
The topk layers (1, 2)
The topk metrics

There are two flavors represented among those examples:

Those with one-to-one relationships between query embeddings and candidate embeddings (like the topK metrics) that are implemented with element-wise products
Those with many-to-many relationships between query embeddings and candidate embeddings (like the retrieval task) that are implemented with matmuls

In order to consolidate the prediction code and abstract the way predictions are computed, it seems like these two modes could either be captured in a single Model method with a flag that selects between element-wise pairs and batch predictions, or represented as two Model methods.

Thoughts?

maciejkula commented 4 years ago

The base Model class itself probably shouldn't define these methods: it may well be used to implement non-factorized ranking models which do not employ the dot-product method of making predictions, or more complex multi-task models that make many predictions.

In general, we aim for composition of layers over large base classes. Along those lines, we could write a score computation layer and use it to replace the piece of code you link to. I am on the fence about the code deduplication benefits of this.

Additionally, we don't define call to return predictions on purpose: being able to separate query and candidate representations is incredibly important both for evaluation and serving of factorized models (where we want to compute a query representation once over many batches of candidates).

karlhigley commented 4 years ago

The base Model class itself probably shouldn't define these methods: it may well be used to implement non-factorized ranking models which do not employ the dot-product method of making predictions, or more complex multi-task models that make many predictions.

Right, I'm not suggesting that all models will be factorized (the opposite, in fact.) I'm saying that the current structure of the library seems to bake the assumption of factorized models into some things it probably shouldn't (like metrics.)

Since the base Model class doesn't even define an abstract call() placeholder (raising NotImplementedError) though, other code throughout the library can't rely on the existence of a (possibly non-factorized) method of making predictions using the model's forward pass. So, the forward pass ends up getting implemented in multiple places, leading to code duplication and increased potential for errors.

(AFAIK, there's no restriction that would prevent call() from returning multiple predictions, right?)

Additionally, we don't define call to return predictions on purpose: being able to separate query and candidate representations is incredibly important both for evaluation and serving of factorized models (where we want to compute a query representation once over many batches of candidates).

I'm with you on using the model a bit differently for serving. That's why I think there might be two methods here: one which computes predictions on a one-to-one basis between queries and candidates (i.e. the forward pass, used during training), and one of which computes many-to-many predictions between query vectors and candidate vectors (i.e. bulk predictions, used for evaluation and retrieval.)

karlhigley commented 4 years ago

It's not quite clear to me how foundational the concept of retrieving candidates based solely on factorized query and candidate embeddings is intended to be in this library. Seems reasonable that it might be intended to be used for all models (in which case it would make sense for the Model class to support it directly), or might be only one of several different ways of performing retrieval that the library is intended to support (in which case a FactorizedModel class might make more sense.)

I realize there's already a DatasetTopK layer to compose with other layers for retrieval; it assumes a factorized model without actually being labeled as such. Meanwhile, the Model class assumes nothing and it seems like maybe you want to keep the idea of factorized models outside it? Hence my confusion.

karlhigley commented 4 years ago

This is basically what I'm imagining:

import tensorflow as tf

class FactorizedModel(models.Model):
    def __init__(self, candidate_dataset):
        super().__init__()

        self.query_model = tf.keras.layers.Dense(16)
        self.candidate_model = tf.keras.layers.Dense(16)

        self.tasks = [
            # tasks.Retrieval(model=self, ...)
            # tasks.Ranking(model=self, ...)
        ]

    # Maintains API compatibility with other TF models
    def call(self, features):
        return self.predict(features)

    # Makes one-to-one predictions between queries and candidates
    def predict(self, features):
        query_features, candidate_features = features

        query_emb = self.query_model(query_features)
        candidate_emb = self.candidate_model(candidate_features)

        scores = tf.reduce_sum(query_emb * candidate_emb, axis=1, keepdims=True)

        return scores

    # Makes many-to-many predictions between queries and candidates
    def bulk_predict(self, features):
        query_features, candidate_features = features

        query_emb = self.query_model(query_features)
        candidate_emb = self.candidate_model(candidate_features)

        scores = tf.matmul(query_emb, candidate_emb, transpose_b=True)

        return scores

    # Computes losses for the tasks registered above
    # (Could probably be extracted to base class as a default implementation)
    def compute_loss(self, inputs, training: bool = False) -> tf.Tensor:
        features, labels = inputs

        predictions = self(features)

        loss = sum([task(predictions, labels) for task in self.tasks])

        return loss

Then the tasks and evaluations can rely on the model to know how to compute its own predictions, both for computing losses during training and for scoring items during evaluation/retrieval.

maciejkula commented 4 years ago

Right, I'm not suggesting that all models will be factorized (the opposite, in fact.) I'm saying that the current structure of the library seems to bake the assumption of factorized models into some things it probably shouldn't (like metrics.)

The FactorizedTopK metric assumes a factorized model because it's meant to be a specialized for use with factorized models. For models that are not factorized, we can use standard Keras metrics (like AUC) or ranking-oriented metrics from TensorFlow Ranking (example, we have an upcoming tutorial that shows how to do this).

Since the base Model class doesn't even define an abstract call() placeholder (raising NotImplementedError) though, other code throughout the library can't rely on the existence of a (possibly non-factorized) method of making predictions using the model's forward pass. So, the forward pass ends up getting implemented in multiple places, leading to code duplication and increased potential for errors.

We have actually toyed with a lot of the alternatives you suggest; in particular, the Retrieval task exposing scoring methods that are then used in factorized prediction in evaluation. This gives you the benefit of centralizing the prediction computation. However, we've found that the factorized prediction computation is always a dot product, and so hard-coding this in a bunch of places give us simpler code and a more streamlined API at no cost that we could see.

I think what would really help me see the issue is an example of a use case that's made difficult or confusing by the way we've set up the API.

karlhigley commented 4 years ago

How would you build a DCN using this framework as currently constructed? Or a factorization machine?

karlhigley commented 4 years ago

Like, if this is a toolkit to build specific kinds of factorized two-tower models and I'm supposed to use normal Keras features and ranking metrics from elsewhere for other kinds of models, that's fine—but it makes this library a lot less useful to me (and also kind of confusingly named.)

maciejkula commented 4 years ago

A DCN is a component for building models rather than a model in itself, and we're adding a DCN layer right now! It'll be here in a couple of weeks. Once it's here, you can use it as a building block for your model, just as any other layer:

my_submodel = tf.keras.Sequential([
  DCN(...hyperparameters...),
  Dense(...),
  Dense(...),
])

predictions = my_submodel(inputs)

You can use that as a component of either a factorized or non-factorized model, depending on what makes sense to you.

A factorization machine layer would in principle it would be just another component, for you to use as you please as part of a larger model.

I think the principle we'd like to follow in developing this library is to (1) add specialized recommender system components where these are lacking in the wider TensorFlow ecosystem, and (2) put together good tutorials on how to assemble these components into models for both common and more sophisticated recommendation tasks. However, if a perfectly good component is available in plain TF or Keras (such as metrics/CNN/RNN layers), we'll definitely use those.

karlhigley commented 4 years ago

I don't want to be Reviewer Number 2, but...

I have experience with recs using Tensorflow, really appreciate your previous work (as you know), want to use this library, and found it confusing enough I wanted to smash my keyboard. The WTFs per minute reading the code were pretty high for me—not because there's anything wrong with the code as code, but because the concepts are not clear from reading the code.

There are a bunch of things specialized for factorized models, but only some of them are labeled. Which things are specialized and which things aren't is surprising. The same concepts are implemented in multiple places in slightly different ways. The batteries aren't included, in the sense that using this library requires knowledge of a bunch of other TF libraries to fill in pieces that seem like they ought to be covered here but aren't.

None of those things would stop me if it seemed like I could contribute to improving the library (which I have time, energy, bandwidth, and motivation to do), but it sounds like the issues I've encountered are the result of the design philosophy of the library, so... :shrug:

maciejkula commented 4 years ago

That sound unpleasant, hope the keyboard is OK!

Could you give me one (or two) examples of where you ran into a WTF moment? Something along these lines:

I wanted to do X.
I looked at Y part of the library.
I experienced WTF moment because Y is poorly suited for X.

maciejkula commented 4 years ago

Just to say that we're taking your feedback seriously - the 0.2.0 release introduced some changes that aim to make it clearer what layers and metrics are intended for use with factorized models.