pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.33k stars 1.86k forks source link

Implement Series similarity functions #5969

Open mzaks opened 1 year ago

mzaks commented 1 year ago

Problem description

Given two Series (column or array) I would like to be able to compute similarity score with Jaccard index.

Further more, other hash based similarity functions like MinHash, SimHash and LSH could be considered when accuracy can be sacrificed in favour of performance.

alexander-beedie commented 1 year ago

This would be a good question for StackOverflow. Assuming you're looking to compute it over binarised data I think it's probably as simple as this:

import polars as pl

def jaccard(s1: pl.Series, s2: pl.Series):
    if len(s1) != len(s2):
        raise ValueError( "The given Series must be the same length" )

    s1, s2 = s1.cast(pl.Boolean), s2.cast(pl.Boolean)
    return (s1 & s2).sum() / (s1 | s2).sum()

Example:


s1 = pl.Series( [0,1,1,1,0,0,0,0,1,1,0,1] )
s2 = pl.Series( [1,1,1,1,0,1,0,0,1,1,0,0] )

jaccard(s1, s2)  # => 0.625
mzaks commented 1 year ago

Sorry but in order to compute a jaccard score or general set similarity, the data does not have to be binary, it doesn't have to be of the same length, so although your solution computes a score it is hardly a real jaccard index function.

E.g [1, 2, 3] and [1, 3] has a jaccard index of 0.5. I fail to see how your solution would be suitable to compute this score.

alexander-beedie commented 1 year ago

Sorry but in order to compute a jaccard score or general set similarity, the data does not have to be binary, it doesn't have to be of the same length, so although your solution computes a score it is hardly a real jaccard index function.

Indeed; that's why I said "assuming" ;) The binarised similarity score is actually one of the most common types of jaccard metrics across many disciplines though. However, if you need one of the more generalised scores then the above is obviously not suitable.

alexander-beedie commented 1 year ago

Try the following more general (not binarised) jaccard metrics instead (note: I think your 0.5 value may be incorrect - I'd expect 0.333 or 0.666 depending on whether you are looking at similarity or dissimilarity/distance).

import polars as pl

def jaccard_similarity(s1: pl.Series, s2: pl.Series):
    s1, s2 = set( s1.unique() ), set( s2.unique() )
    s3 = s1.intersection( s2 )
    return len(s3) / (len(s1) + len(s2) - len(s3))

def jaccard_distance(s1: pl.Series, s2: pl.Series):
    return 1 - jaccard_similarity( s1,s2 )

Example:

s1 = pl.Series( [1,2,3,4,5] ) 
s2 = pl.Series( [4,5,6,7,8,9,10] )

jaccard_similarity(s1,s2)  # => 0.2
jaccard_distance(s1,s2)    # => 0.8

(We don't currently have a native intersection operation for Series).

cbilot commented 1 year ago

We don't currently have a native intersection operation for Series

Just a small suggestion: one optimization would be to use a join instead of using python set. For example:

def jaccard_similarity(s1: pl.Series, s2: pl.Series):
    s1, s2 = s1.unique(), s2.unique()
    s3 = s1.to_frame().join(s2.to_frame(), how="inner", on="").to_series()
    return s3.len() / (s1.len() + s2.len() - s3.len())

def jaccard_distance(s1: pl.Series, s2: pl.Series):
    return 1 - jaccard_similarity(s1, s2)

Benchmarking these:

import numpy as np
import time

rng = np.random.default_rng()
size = 10_000_000
s1 = pl.Series(rng.integers(0, 10_000_000, size=size))
s2 = pl.Series(rng.integers(0, 10_000_000, size=size))

start = time.perf_counter()
jaccard_similarity(s1, s2)
print("Using join:", time.perf_counter() - start)

Using join:

0.4620962633223721
>>> print("Using join:", time.perf_counter() - start)
Using join: 0.3586309409997739

Using python set:

0.4620962633223721
>>> print("Using python sets:", time.perf_counter() - start)
Using python sets: 23.385584267000013
alexander-beedie commented 1 year ago

Just a small suggestion: one optimization would be to use a join instead of using python set.

I am kicking myself - that's a much better implementation ;)

ritchie46 commented 1 year ago

Good to see you again @cbilot. 🙂 Happy new year!

mzaks commented 1 year ago

:) Thanks for the suggestions folks. I think I cause some confusion, because I did not communicate the context of the ticket clearly in the description.

I will try to do better now :).

This issue was created after a discussion I had with Ritchie on Discord. The solution with a join works fine with a Series, but it will be less convenient, when need to be applied on a column with array values. In this case we would need to explode, join and then group by again to get the count of the overlapping members of two sets. This also works, but is quiet memory intensive as far as I can tell.

Example:

df = pl.DataFrame({"A": [1, 2, 3], "B": [[1, 2, 3], [3, 4], [5]], "C": [[1, 3, 5], [3, 4, 1], [6, 7]]})

where we want to compute similarity score between B and C.

Given this wrinkle and general usefulness of set similarity functions Ritchie asked me to create an issue.

ritchie46 commented 1 year ago

@mzaks I think I have something better than multiprocessing:

https://github.com/pola-rs/pyo3-polars

This allows you to make udfs in rust and seemlessly convert a python DataFrame to a rust DataFrame.

mzaks commented 1 year ago

Thanks @ritchie46! I will have a look.