monarch-initiative / semsimian

Simple rust implementation of semantic similarity
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

implement cosine similarity #56

Open justaddcoffee opened 1 year ago

justaddcoffee commented 1 year ago

We'd like to have a function measure of cosine similarity between terms, e.g.

pub fn cosine_similarity(
    embeddings: &DataFrame, // probably a Polars DF
    entity1: &str,
    entity2: &str,
) -> f64 {
    /* Returns cosine similarity between the two terms */
    let entity1_embedding = get_embedding(entity1, embeddings);
    let entity2_embedding = get_embedding(entity2, embeddings);
    let cosine_sim = calculate_cosine_similarity(&entity1_embedding, &entity1_embedding);
    cosine_sim
}

I'd suggest we let the caller bring their own embeddings in GRAPE (i.e. Pandas) format, then we can calculate cosine sim efficiently in Rust (possibly using Polars?)

Obviously we'd want to build cosine similarity into all_by_all_similarity() too eventually

per discussion with @iQuxLE

cc @cmungall @julesjacobsen @matentzn

justaddcoffee commented 1 year ago

Also Luca/Tommy have implemented an efficient cosine sim function in Ensmallen that we possibly could crib

cmungall commented 1 year ago

Let's be careful to distinguish two things:

1 the similarity/distance metric 2 what the metric is operating over

On Fri, May 26, 2023 at 9:24 AM Justin Reese @.***> wrote:

Also Luca/Tommy have implemented an efficient cosine sim function https://github.com/AnacletoLAB/ensmallen/blob/master/graph/express_measures/src/cosine_similarity.rs that we possibly could crib

— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/semsimian/issues/56#issuecomment-1564635018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONHVZ37EU7DL5F5NHLXIDKLHANCNFSM6AAAAAAYQNS3CU . You are receiving this because you were mentioned.Message ID: @.***>

justaddcoffee commented 1 year ago

I think 2 would be handled entirely on the Python side by filling out the appropriate entries in a TermPairwiseSimilarity right? this issue is just to implement (hopefully efficiently) the calculation of cosine similarity