Some issues not covered in the tutorials

dgoldenberg-audiomack commented 3 years ago

Apologies, a noob here but I suspect my questions will be repeated by other noobs :)

Could the developers please elaborate on the below (I think these could be good additions to the tutorials too).

How can one specify the number of recommendations to generate per user? The default appears to be 10; where can this be overridden?
How does the caller retrieve the generated recommendations along with the respective recommendation ratings? _, titles = index(tf.constant(["42"])) print(f"Recommendations for user 42: {titles[0, :3]}") This retrieves the movie titles, in the tutorial, but not the rating values. (OK, x, titles = index(tf.constant(["42"])) -- looks like the x tensor contains the ratings)
The generation of the embedding values e.g. for user ID's. Must these be contiguous integers? Can I re-use my own ID values? E.g. I have two users, user 1 with ID=123 and user 2 with ID=998. Can I use 123 and 998 or must I map these ID's to 1 and 2? (I assume it's the latter approach but please clarify).
Is there a way to instruct the recommenders code not to include user's history items in the generated recommenders? The idea is to avoid having to do my own post-filter.
Is there a way to instruct the recommenders code not to include multiples of the same item in the generated recommendations? There was a sentence in the tutorial that led me to believe that duplicates might be present (please clarify).
During featurization/tokenization, have folks worked out the I18N aspects? Say, if I have text features in English and Spanish, are both an English tokenizer and a Spanish tokenizer available? How well would they work with short strings? Is there a Language Identifier to wire in?
What is a 'good' range of values for the top-100 accuracy? should it be close to 1? how close?
How to bring RMSE down? When running some of the examples, RMSE tends to be > 1. Is there an optimal number of features to use, perhaps? Any suggestions as to the tuning of the hyperparameters to keep the RMSE below 1?
How to scale/distribute the processing? If I have several million users and several million items and 3-4 features on users and 5-7 features on the items, what would be some of the approaches to scale this? I'm looking at this writeup: https://towardsdatascience.com/scaling-up-with-distributed-tensorflow-on-spark-afc3655d8f95. Any examples of how to code up a TFRS recommender that could run on Spark? or some other way to distribute. Seems like TensorFlowOnSpark is a way to go...
How to tone down the amount of prints in the console.

Thanks.

dgoldenberg-audiomack commented 3 years ago

Retrieving recs with respective ratings looks like this for me now:

# Get recommendations for user 42:
ratings_42, titles_42 = index(tf.constant(["42"]))
print(">> Recommendations for user 42:")
ratings_arr = ratings_42[0]
titles_arr = titles_42[0]
for idx in range(0, len(ratings_arr)):
    print(f"\t{ratings_arr[idx]} -- {titles_arr[idx].numpy().decode('utf-8')}")

dgoldenberg-audiomack commented 3 years ago

Toning down the prints to the console

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
logging.getLogger('tensorflow').setLevel(logging.FATAL)

also

model.fit(cached_train, epochs=3, verbose=0)

This doesn't remove all prints but it does remove quite a few.

maciejkula commented 3 years ago

Use the k parameter detailed in the docs.
The first element of that tuple are the predictions.
These can be anything, but non-contiguous values waste memory. This is why the tutorials use vocabulary lookup layers.
Unfortunately at this time you have to do your own post-filter.
If you are using the top-k layers, as long as they are indexed with a dataset that does not contain duplicates, the responses will not have duplicates.
There may be solutions in the wider TensorFlow ecosystem to help with this. TFRS itself does not offer this functionality.
That's really dataset-dependent.
Hyperparameter tuning is crucial. (And RMSE is absolutely the wrong metric to look at - top-K metrics are much more useful.)
TensorFlow has extensive distributed training capabilities. Have a look at this tutorial for a starting point.
Set the verbose argument to Keras fit/evaluate calls.

dgoldenberg-audiomack commented 3 years ago

Hi @maciejkula Maciej, thanks for your prompt and thorough responses. Could we have this ticket open for now, I think I'm going to have a couple more questions as I work through my POC. Thanks.

dgoldenberg-audiomack commented 3 years ago

Hi @maciejkula Maciej

Toning down the amount of prints/logging coming from TF/TFRS. This has taken a few steps for me:
- Set the verbose argument to Keras fit/evaluate calls to 0.
- This env setting must come before the TF imports:
```
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
if True:  # noqa: E402
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
```
- Also these:
```
tf.get_logger().setLevel(logging.FATAL)
logging.getLogger("tensorflow").setLevel(logging.FATAL)
logging.getLogger("tensorflow_recommenders").setLevel(logging.FATAL)
```
- Use @tf.autograph.experimental.do_not_convert on functions to turn off Autograph prints/logging.
- I'd still like to turn off the below kind of output. Any idea as to how to turn these off?

1/5 [=====> ... ] - ETA: 0s - factorized_top_k/top_1_categorical_accuracy: 4.8828e-04

factorized_top_k/top_5_categorical_accuracy: 0.0078

factorized_top_k/top_10_categorical_accuracy: 0.0215

factorized_top_k/top_50_categorical_accuracy: 0.1299

factorized_top_k/top_100_categorical_accuracy: 0.2314

loss: 32467.8496 - regularization_loss: 0.0000e+00

total_loss: 32467.8496

Hyperparameter tuning is crucial.
- which params to tune? if we're looking at the movielens dataset/ 'retrieval' sample, then the evaluate call returns the following... How good/bad are these? and what are the criteria to judge these?
  
  'factorized_top_k/top_1_categorical_accuracy': 0.000699999975040555, 'factorized_top_k/top_5_categorical_accuracy': 0.009349999949336052, 'factorized_top_k/top_10_categorical_accuracy': 0.022549999877810478, 'factorized_top_k/top_50_categorical_accuracy': 0.12484999746084213, 'factorized_top_k/top_100_categorical_accuracy': 0.23270000517368317, 'loss': 28244.7734375, 'regularization_loss': 0, 'total_loss': 28244.7734375
Scaling/distributing.

Have a look at this tutorial for a starting point.

These look like nice strategies but they all seem targeted toward specific machines. What are some of the approaches if you're running on a 'utility' cluster, say, you're in AWS EMR. How does one integrate and scale TFRS there?

dgoldenberg-audiomack commented 3 years ago

Any idea on this?

Trying the MirroredStrategy with the MovieLens/retrieval type of sample, on Spark (AWS EMR).

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    movies, test, train, unique_movie_titles, unique_user_ids = prepare_data(movies, ratings)
    model = create_and_train_model(movies, test, train, unique_movie_titles, unique_user_ids)
    generate_recommendations(model, movies)

where

def create_and_train_model(movies, test, train, unique_movie_titles, unique_user_ids):
    embedding_dimension = 32

    # The query tower
    user_model = tf.keras.Sequential(
        [
            tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_user_ids, mask_token=None),
            # We add an additional embedding to account for unknown tokens.
            tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension),
        ]
    )

    # The candidate tower
    movie_model = tf.keras.Sequential(
        [
            tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_movie_titles, mask_token=None),
            tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension),
        ]
    )

    # Metrics
    metrics = tfrs.metrics.FactorizedTopK(candidates=movies.batch(128).map(movie_model))

    # Loss
    task = tfrs.tasks.Retrieval(metrics=metrics)

    cached_train = train.shuffle(100_000).batch(8192).cache()
    cached_test = test.batch(4096).cache()

    model = MovielensModel(user_model, movie_model, task)
    model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

    # train the model
    model.fit(cached_train, epochs=3, verbose=0)
    eval_results = model.evaluate(cached_test, return_dict=True)
    return model

Got the below error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation sequential/embedding/embedding_lookup/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node sequential/embedding/embedding_lookup/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU 
Cast: GPU CPU 
Const: GPU CPU 
ResourceSparseApplyAdagradV2: CPU 
_Arg: GPU CPU 
ReadVariableOp: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  sequential_embedding_embedding_lookup_readvariableop_resource (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  adagrad_adagrad_update_1_update_0_resourcesparseapplyadagradv2_accum (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  sequential/embedding/embedding_lookup/ReadVariableOp (ReadVariableOp) 
  sequential/embedding/embedding_lookup/axis (Const) 
  sequential/embedding/embedding_lookup (GatherV2) 
  gradient_tape/sequential/embedding/embedding_lookup/Shape (Const) 
  gradient_tape/sequential/embedding/embedding_lookup/Cast (Cast) 
  Adagrad/Adagrad/update_1/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0

     [[{{node sequential/embedding/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_1206667]

davidcereal commented 3 years ago

@dgoldenberg-audiomack, it appears Adagrad is missing some GPU implementation aspects in TF, as pointed out in issue #138 Perhaps try SGD or Adam optimizer.

dgoldenberg-audiomack commented 3 years ago

Thanks, @davidcereal! This definitely has fixed that issue for me. I've peeked through the issue list for TF but didn't see anything related to this particular thing (OK, I see 138 here, thanks)

tensorflow / recommenders

Some issues not covered in the tutorials #190