Feature preprocessing doc needs more info

dgoldenberg-audiomack commented 3 years ago

I think others may be looking for the same kinds of info as me.

Working through the featurization doc, questions arise (can't seem to find relevant info on SOF):

If you have user/item events with associated numeric ratings (assigned explicitly or computed implicitly), how do you featurize the ratings? The examples focus on featurizing user ID's and item ID (e.g. the movie titles) but what about event ratings in addition to those two?
Text feature processing. Are there any how-to's or recipes for how to deal with multiple languages? There doesn't seem to be anything about multi-lang in the keras TextVectorization doc. This SOF post talks about using NLTK or the like. It would be very helpful if TFRS had a writeup on how to do multi-lang text features.
"Turning categorical features into embeddings" talks about translating raw tokens to embedding ids and uses the adapt method. The Retrieval tutorial does not use adapt. How important is it to convert, for example, string user ID's into integers? Is this step a must? "During model training, the value of that vector is adjusted to help the model predict its objective better." -- does this mean that without the conversion to integers, accuracy of predictions will be a lot worse? If this step is done, would I still be able to use the string user ID's when looking up predictions, or do I need to use the integers? How does one go back and forth between the two forms?
When instrumenting features for a model, how can I indicate the relative importance of one feature vs. another? For instance, if I want genre to be treated as N times more important than for example the duration of the movie, how do I specify that? Or, do we leave it up to TF itself to learn what's more imporant?
User geo-location featurization. I have the following types of geoloc data on events: geoname_id (a unique geoloc integer), country, administrative division (e.g. state), latitude and longitude. Does it make sense to featurize the location ID's or something less precise such as e.g. country+state(if any) ?

dgoldenberg-audiomack commented 3 years ago

Hi, is there any update on this issue? Thanks.

maciejkula commented 3 years ago

As a piece of general advice I strongly recommend starting with general deep learning resources or courses. They will have decent answers to all of you questions; none of this is specific to TensorFlow Recommenders.

You can use the raw number; you can embed it. Or both!
Deep NLP is a very big topic. If you're just tokenizing and embedding, however, this might not matter - tokens are tokens and it doesn't matter what language they come with. TensorFlow Hub might have some pretrained multilingual models for you if you want to go deeper.
I recommend learning how embeddings work.
In general the model will try to figure that out.
It might, depending on the sparsity of your data.

maciejkula commented 3 years ago

I struggle to find a good explanation of the raw features -> vocabularies -> embeddings flow, but this is a reasonable article.

dgoldenberg-audiomack commented 3 years ago

Thanks, @maciejkula. I'll check out the link. It feels like more developers will encounter the same questions; I think the library would benefit from having recipes/samples which address these.

numeric ratings

Looking into embeddings, what's not clear to me is, the TFRS models focus on two towers: query/users and items. Since ratings are tuples of { user ID, item ID, rating }, it's not clear what tower to graft them on. Might there need to be a "third tower"?

tensorflow / recommenders

Feature preprocessing doc needs more info #207