pola-rs / valves

general functions for your data .pipe()-lines.
16 stars 2 forks source link

user_item and item_item recommender tables #12

Open koaning opened 2 years ago

koaning commented 2 years ago

Given a log of weighted user-item interactions, can we generate a item-item recommendation table and a user-item recommendation table?

Kind of! We can calculate p(item_a | item_b) and p(item_a) which is can be reweighed into a table with recommendations. We can also do something similar for users. After all, a user that interactive with items a, b and c will have a score for item x defined via;

p(item_x | user) = p(item_x | item_a, item_b, item_c)
                 \propto p(item_x | item_a) p(item_x| item_b) p(item_x|item_c)
ritchie46 commented 2 years ago

Interesting.. Would every cell in one table need to be computed with all others?

koaning commented 2 years ago

I don't think so unless every user has interacted with every item.

I've started with a item-item count table though.

def item_item_counts(dataf, user_col="user", item_col="item"):
    """
    Computers item-item overlap counts from user-item interactions, useful for recommendations.

    This function is meant to be used in a `.pipe()`-line.

    Arguments:
        - dataf: polars dataframe
        - user_col: name of the column containing the user id
        - item_col: name of the column containing the item id
    """
    return (dataf
        .with_columns([
            pl.col(pl.col(item_col)).list().over('user').explode().alias("item_rec"),
        ])
        .filter(pl.col(item_col) != pl.col("item_rec"))
        .with_columns([
            pl.col(user_col).count().over(pl.col(item_col)).alias("n_item"),
            pl.col(user_col).count().over('item_rec').alias("n_item_rec"),
            pl.col(user_col).count().over([pl.col(item_col), 'item_rec']).alias("n_both")
        ])
        .select(['item', 'item_rec', 'n_item', 'n_item_rec', 'n_both'])
        .drop_duplicates()
    )

Something is telling me these kinds of queries are gonna benchmark reaaaal well.

koaning commented 2 years ago

Hebbes.

It's something like this;

result = (df
  .pipe(remove_outliers)
  .with_column(
      pl.col('item').list().over('user').explode().alias("item_rec")
  )
  .filter(pl.col("item") != pl.col("item_rec"))
  .with_columns([
    pl.col('user').count().over('item').alias("n_item"),
    pl.col('user').count().over('item_rec').alias("n_item_rec"),
    pl.col('user').count().over(['item', 'item_rec']).alias("n_both")
  ])
)

(result
  .with_column((pl.col('n_both')/pl.col('n_item')).alias('rating'))
  .filter(pl.col('n_both') > 10)
  .sort(['item', 'rating'], reverse=True))
koaning commented 2 years ago

@ritchie46 does polars support log?