nestauk / industrial_taxonomy

Refactor of nestauk/industrial-taxonomy which upon completion will replace it.
MIT License
3 stars 0 forks source link

18 Vectorise matched glass descriptions #26

Closed georgerichardson closed 2 years ago

georgerichardson commented 2 years ago

closes #18


Checklist:

georgerichardson commented 2 years ago

I haven't run any sanity checks on the results besides checking shape and that order is retained. Will do a manual check of the results before we merge. Small but not insignificant chance another model might work noticeably better

georgerichardson commented 2 years ago

There is now a quality assurance flow flow_qa.py which produces a chart and number that show the percent of descriptions that are not truncated by the maximum input length of the sentence encoder. It also produces a random sample of 100 company descriptions that are matched to their nearest neighbour by cosine distance according to the embeddings produced.