Open mzaks opened 1 year ago
This would be a good question for StackOverflow. Assuming you're looking to compute it over binarised data I think it's probably as simple as this:
import polars as pl
def jaccard(s1: pl.Series, s2: pl.Series):
if len(s1) != len(s2):
raise ValueError( "The given Series must be the same length" )
s1, s2 = s1.cast(pl.Boolean), s2.cast(pl.Boolean)
return (s1 & s2).sum() / (s1 | s2).sum()
Example:
s1 = pl.Series( [0,1,1,1,0,0,0,0,1,1,0,1] )
s2 = pl.Series( [1,1,1,1,0,1,0,0,1,1,0,0] )
jaccard(s1, s2) # => 0.625
Sorry but in order to compute a jaccard score or general set similarity, the data does not have to be binary, it doesn't have to be of the same length, so although your solution computes a score it is hardly a real jaccard index function.
E.g [1, 2, 3]
and [1, 3]
has a jaccard index of 0.5
. I fail to see how your solution would be suitable to compute this score.
Sorry but in order to compute a jaccard score or general set similarity, the data does not have to be binary, it doesn't have to be of the same length, so although your solution computes a score it is hardly a real jaccard index function.
Indeed; that's why I said "assuming" ;) The binarised similarity score is actually one of the most common types of jaccard metrics across many disciplines though. However, if you need one of the more generalised scores then the above is obviously not suitable.
Try the following more general (not binarised) jaccard metrics instead (note: I think your 0.5 value may be incorrect - I'd expect 0.333 or 0.666 depending on whether you are looking at similarity or dissimilarity/distance).
import polars as pl
def jaccard_similarity(s1: pl.Series, s2: pl.Series):
s1, s2 = set( s1.unique() ), set( s2.unique() )
s3 = s1.intersection( s2 )
return len(s3) / (len(s1) + len(s2) - len(s3))
def jaccard_distance(s1: pl.Series, s2: pl.Series):
return 1 - jaccard_similarity( s1,s2 )
Example:
s1 = pl.Series( [1,2,3,4,5] )
s2 = pl.Series( [4,5,6,7,8,9,10] )
jaccard_similarity(s1,s2) # => 0.2
jaccard_distance(s1,s2) # => 0.8
(We don't currently have a native intersection
operation for Series).
We don't currently have a native intersection operation for Series
Just a small suggestion: one optimization would be to use a join
instead of using python set
. For example:
def jaccard_similarity(s1: pl.Series, s2: pl.Series):
s1, s2 = s1.unique(), s2.unique()
s3 = s1.to_frame().join(s2.to_frame(), how="inner", on="").to_series()
return s3.len() / (s1.len() + s2.len() - s3.len())
def jaccard_distance(s1: pl.Series, s2: pl.Series):
return 1 - jaccard_similarity(s1, s2)
Benchmarking these:
import numpy as np
import time
rng = np.random.default_rng()
size = 10_000_000
s1 = pl.Series(rng.integers(0, 10_000_000, size=size))
s2 = pl.Series(rng.integers(0, 10_000_000, size=size))
start = time.perf_counter()
jaccard_similarity(s1, s2)
print("Using join:", time.perf_counter() - start)
Using join
:
0.4620962633223721
>>> print("Using join:", time.perf_counter() - start)
Using join: 0.3586309409997739
Using python set
:
0.4620962633223721
>>> print("Using python sets:", time.perf_counter() - start)
Using python sets: 23.385584267000013
Just a small suggestion: one optimization would be to use a join instead of using python set.
I am kicking myself - that's a much better implementation ;)
Good to see you again @cbilot. 🙂 Happy new year!
:) Thanks for the suggestions folks. I think I cause some confusion, because I did not communicate the context of the ticket clearly in the description.
I will try to do better now :).
This issue was created after a discussion I had with Ritchie on Discord. The solution with a join works fine with a Series, but it will be less convenient, when need to be applied on a column with array values. In this case we would need to explode, join and then group by again to get the count of the overlapping members of two sets. This also works, but is quiet memory intensive as far as I can tell.
Example:
df = pl.DataFrame({"A": [1, 2, 3], "B": [[1, 2, 3], [3, 4], [5]], "C": [[1, 3, 5], [3, 4, 1], [6, 7]]})
where we want to compute similarity score between B
and C
.
Given this wrinkle and general usefulness of set similarity functions Ritchie asked me to create an issue.
@mzaks I think I have something better than multiprocessing:
https://github.com/pola-rs/pyo3-polars
This allows you to make udfs in rust and seemlessly convert a python DataFrame to a rust DataFrame.
Thanks @ritchie46! I will have a look.
Problem description
Given two Series (column or array) I would like to be able to compute similarity score with Jaccard index.
Further more, other hash based similarity functions like MinHash, SimHash and LSH could be considered when accuracy can be sacrificed in favour of performance.