tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.85k stars 278 forks source link

Ranking products #543

Open Ullar-Kask opened 2 years ago

Ullar-Kask commented 2 years ago

Hi,

In a ranking model for a web-store- what is customary to use as product ranking? In the movie database it's movie rankings given by users, in the web-store there are no product rankings, just transaction data (who purchased what and when).

(Essentially the same question as in https://github.com/tensorflow/recommenders/issues/355 and https://github.com/tensorflow/recommenders/issues/389)

hkristof03 commented 2 years ago

@Ullar-Kask

You should have data about what was visible for the user each time on the page. Products that were visible for the user and did not result in a click serve as implicit negative examples, while products that were clicked on serve as explicit positive examples. Giving additional features for both the user and the items, you predict the probability for each item to be clicked on. Then you sort these items based on the predicted probabiltiies.

Ullar-Kask commented 2 years ago

Thanks for your thoughts! We do not have data about what was visible for the user each time on the page, nor the click data. As a solution to the problem, I am thinking of generating synthetic transaction data as negative samples. Namely, for each transaction record one (or two, or say, N) synthetic records using true transaction data and a randomly selected item from the set of items the customer has not purchased and setting label=0 for the record. What do you think of this approach? Might it work? What is the reasonable value of N? Does the purchase frequency of an item play a role when used as such a negative sample?

JV-Nunes commented 1 year ago

@Ullar-Kask Do you have some updates on how your approach suceeded? I am facing the same cenario, if you could provide some code on how you manage to pre-process your data to the described format would be very helpful.

Ullar-Kask commented 1 year ago

The approach works as our testing shows. We generate N negative samples for each positive sample (as mentioned above). The larger the value of N the better results. Currently we have N=40, and it's limited by the mount of memory in the server. I am not displaying the complete code because it's pretty technical, but in principal we loop over customers, for customer_id, df_customer in df.groupby('customer_id', sort=False)[['product_id']]:, generate negative samples for each customer and "rate" them in the following way (label="rating"): 1) label=0: randomly picked unpurchased unclicked unrecommended product from the product catalog (large source of random products); 2) label=1: recommended but unclicked and unpurchased product (unpurchased within the timeframe on T-365...T-90 days); 3) label=2: clicked but unpurchased product; and then for positive samples we take 4) label=3: purchased product

Products with labels 0..2 sum up to N for each label=3 record. You may experiment by switching labels 0 and 1.

Here you have the "movie ratings" database ;)

BTW, using cudf instead of pandas speeds negative samples generation 2x.

JV-Nunes commented 1 year ago

@Ullar-Kask great approach! Thanks for sharing. At the moment I don't have click information, but I do know if the product has been recommended in the past. I'm going to try a scoring system similar to the one you use. As for the model itself, is the one developed in this tutorial a good starting point?