trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.26k stars 2.95k forks source link

Incorrect Jaccard Index Calculation in Trino #21331

Open Akanksha-kedia opened 6 months ago

Akanksha-kedia commented 6 months ago

Title: Incorrect Jaccard Index Calculation in Trino

Description:

I've encountered an issue with the jaccard_index function in Trino where the output does not match the expected result according to the Jaccard index formula.

Here are the queries I ran:

SELECT jaccard_index(make_set_digest(value), make_set_digest(value1)) FROM (VALUES ('abc', 'def'),('ee', 'abc')) T(value,value1); The expected Jaccard index for this query should be 0.3333333333333333, but the output is 0.5.

SELECT jaccard_index(make_set_digest(value), make_set_digest(value1)) FROM (VALUES (1,4),(2,5),(3,6),(4,7),(5,8)) T(value,value1); For this query, the sets are s1 = {1, 2, 3, 4, 5} and s2 = {4, 5, 6, 7, 8}. The expected Jaccard index is 0.25, but the output is 0.4.

The Jaccard index is a measure of the similarity between two sets and is calculated as the size of the intersection divided by the size of the union of the two sets. Based on this, the outputs of the above queries should be 0.3333333333333333 and 0.25 respectively.

This seems to be a bug in the jaccard_index function in Trino.

Can someone look into this.

wendigo commented 6 months ago

@Akanksha-kedia I've checked the implementation and seems that you are correct. Fix is on the way

martint commented 6 months ago

This is a duplicate of https://github.com/trinodb/trino/issues/18995

Akanksha-kedia commented 6 months ago

i have closed, This is a duplicate of https://github.com/trinodb/trino/issues/18995.