Note: this has been modified to use faiss-gpu giving ~4x speed up across the whole pipeline (speed of individual searches depends on the size of the index).
This flow takes cluster outputs from ClusterGlass flow and reassigns cluster labels to companies based on their nearest neighbours using embeddings from GlassEmbed.
The input is a dictionary whose keys are clustering assignment parameters and values are lists of (<cluster label>, <glass org id>) tuples.
This flow looks at the K nearest neighbour organisations by embedding cosine similarity, groups those neighbours by their cluster labels and calculates the average similarity. The best result is the closest. No threshold is applied - this is left up to the user of the results. Note that this flow finds nearest neighbours for all Glass orgs, not just the ones that were clustered.
The results, artifact clusters_reassigned, are in dictionary form, with keys for the org id, the original cluster label (if applicable), the cluster label of the nearest cluster and its average distance. The results are in list form under each of these keys.
As an example, this is the first 5 rows of reassigment outputs for clusters under the assigned_10 parameter:
closes #21
Note: this has been modified to use
faiss-gpu
giving ~4x speed up across the whole pipeline (speed of individual searches depends on the size of the index).This flow takes cluster outputs from
ClusterGlass
flow and reassigns cluster labels to companies based on their nearest neighbours using embeddings fromGlassEmbed
.The input is a dictionary whose keys are clustering assignment parameters and values are lists of
(<cluster label>, <glass org id>)
tuples.This flow looks at the K nearest neighbour organisations by embedding cosine similarity, groups those neighbours by their cluster labels and calculates the average similarity. The best result is the closest. No threshold is applied - this is left up to the user of the results. Note that this flow finds nearest neighbours for all Glass orgs, not just the ones that were clustered.
The results, artifact
clusters_reassigned
, are in dictionary form, with keys for the org id, the original cluster label (if applicable), the cluster label of the nearest cluster and its average distance. The results are in list form under each of these keys.As an example, this is the first 5 rows of reassigment outputs for clusters under the
assigned_10
parameter:Checklist:
notebooks/
flake8
and addressed any linter erorspre-commit
and addressed any issues not automatically fixeddev
(or merged any new changes fromdev
)README
soutput/reports/