21 reassign clusters - Githubissues

closes #21

Note: this has been modified to use faiss-gpu giving ~4x speed up across the whole pipeline (speed of individual searches depends on the size of the index).

This flow takes cluster outputs from ClusterGlass flow and reassigns cluster labels to companies based on their nearest neighbours using embeddings from GlassEmbed.

The input is a dictionary whose keys are clustering assignment parameters and values are lists of (<cluster label>, <glass org id>) tuples.

This flow looks at the K nearest neighbour organisations by embedding cosine similarity, groups those neighbours by their cluster labels and calculates the average similarity. The best result is the closest. No threshold is applied - this is left up to the user of the results. Note that this flow finds nearest neighbours for all Glass orgs, not just the ones that were clustered.

The results, artifact clusters_reassigned, are in dictionary form, with keys for the org id, the original cluster label (if applicable), the cluster label of the nearest cluster and its average distance. The results are in list form under each of these keys.

As an example, this is the first 5 rows of reassigment outputs for clusters under the assigned_10 parameter:

{'assigned_10': {'best_cluster': 
  0    8299_34
  1    6920_15
  2     3319_0
  3    7490_27
  4    4110_10
  Name: best_cluster, dtype: object,
  'best_cluster_mean_dist': 
  0    0.680860
  1    0.753023
  2    0.648635
  3    0.508862
  4    0.649104
  Name: best_cluster_mean_dist, dtype: float32,
  'org_id': 
  0     782101
  1    4557116
  2    3290485
  3    2702523
  4     756259
  Name: org_id, dtype: int64,
  'original_cluster': 
  0    9609_0
  1    7022_0
  2    4329_0
  3      None
  4    7490_0
  Name: original_cluster, dtype: object}}

Checklist:

[x] I have refactored my code out from notebooks/
[x] I have run flake8 and addressed any linter erors
[x] I have checked the code runs
[ ] I have tested the code
[x] I have run pre-commit and addressed any issues not automatically fixed
[x] I have rebased onto dev (or merged any new changes from dev)
[x] I have documented the code
- [x] Major functions have docstrings
- [x] Appropriate information has been added to READMEs
[x] I have explained the feature in this PR or (better) in output/reports/
[x] I have requested a code review

nestauk / industrial_taxonomy

21 reassign clusters #29