Open koaning opened 2 years ago
Interesting.. Would every cell in one table need to be computed with all others?
I don't think so unless every user has interacted with every item.
I've started with a item-item
count table though.
def item_item_counts(dataf, user_col="user", item_col="item"):
"""
Computers item-item overlap counts from user-item interactions, useful for recommendations.
This function is meant to be used in a `.pipe()`-line.
Arguments:
- dataf: polars dataframe
- user_col: name of the column containing the user id
- item_col: name of the column containing the item id
"""
return (dataf
.with_columns([
pl.col(pl.col(item_col)).list().over('user').explode().alias("item_rec"),
])
.filter(pl.col(item_col) != pl.col("item_rec"))
.with_columns([
pl.col(user_col).count().over(pl.col(item_col)).alias("n_item"),
pl.col(user_col).count().over('item_rec').alias("n_item_rec"),
pl.col(user_col).count().over([pl.col(item_col), 'item_rec']).alias("n_both")
])
.select(['item', 'item_rec', 'n_item', 'n_item_rec', 'n_both'])
.drop_duplicates()
)
Something is telling me these kinds of queries are gonna benchmark reaaaal well.
Hebbes.
It's something like this;
result = (df
.pipe(remove_outliers)
.with_column(
pl.col('item').list().over('user').explode().alias("item_rec")
)
.filter(pl.col("item") != pl.col("item_rec"))
.with_columns([
pl.col('user').count().over('item').alias("n_item"),
pl.col('user').count().over('item_rec').alias("n_item_rec"),
pl.col('user').count().over(['item', 'item_rec']).alias("n_both")
])
)
(result
.with_column((pl.col('n_both')/pl.col('n_item')).alias('rating'))
.filter(pl.col('n_both') > 10)
.sort(['item', 'rating'], reverse=True))
@ritchie46 does polars support log
?
Given a log of weighted
user
-item
interactions, can we generate aitem-item
recommendation table and auser-item
recommendation table?Kind of! We can calculate
p(item_a | item_b)
andp(item_a)
which is can be reweighed into a table with recommendations. We can also do something similar for users. After all, a user that interactive with itemsa
,b
andc
will have a score for itemx
defined via;