koaning / embetter

just a bunch of useful embeddings
https://koaning.github.io/embetter/
MIT License
469 stars 15 forks source link

Add `similartiy` utility. #64

Closed koaning closed 1 year ago

koaning commented 1 year ago

Something like this:

import numpy as np 
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity

def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
    """
    Shortcut to compare a sequence of inputs to a set of anchors. 

    The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.

    You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).

    Arguments:
        - inputs: sequence of inputs to calculate scores for
        - anchors: set/list of anchors to compare against
        - pipeline: the pipeline to use to calculate the embeddings
        - anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
        - metric: the distance metric to use 
        - aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
        - n_jobs: set to -1 to use all cores for calculation
    """
    X_input = pipeline.transform(inputs)
    if anchor_pipeline:
        X_anchors = anchor_pipeline.transform(anchors)
    else:
        X_anchors = pipeline.transform(anchors)

    X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
    return aggregate(X_dist, axis=1)
koaning commented 1 year ago

Then the Prodigy recipe might use something like:

from prodigy.sorters import ExpMovingAverage, prefer_low_scores

def make_scored_stream(stream, anchors):
    for batch in batched(stream):
        batch_text = [b['text'] for b in batch]
        distances = calc_distance(batch_text, anchors, pipeline)
        for score, ex in zip(distances, batch):
            yield score, ex

def sorted_stream(stream):
    return prefer_low_scores(ExpMovingAverage(stream))

Worth rethinking though. Something about recalculating the anchors feels a bit wasteful.