gu-gridh / norfam-frontend

Vue frontend for Nordisk familjebok
0 stars 0 forks source link

Publish term similarities for download on dh.gu.se #10

Open arildm opened 2 years ago

arildm commented 2 years ago

There are 10 similarity scores for each term – 1M for Idun and 1,5M for Ugglan. We can get them with this query:

SELECT
  t1.term_term AS term1,
  t2.term_term AS term2,
  similarity
FROM termsim
JOIN term AS t1 ON t1.term_id = termsim.term1_id
JOIN term AS t2 ON t2.term_id = termsim.term2_id;

Adding the restriction WHERE t1.term_term = 'aachen' OR t2.term_term = 'aachen', we get the illustrative excerpt below. It highlights that A is not necessarily one of the 10 most similar terms to B, even if B is one of the 10 most similar terms to A. For instance, fästningsfyrkanten is a very unusual word, and is relatively unlikely to be in the Top 10 for any other term.

term1 term2 similarity
aachen regensburg 0.8070785999
aachen baden 0.7718966007
aachen schlesien 0.7522234917
aachen mainz 0.7426618338
aachen hannover 0.7299875021
aachen böhmen 0.7168704271
aachen prag 0.7159190178
aachen westfalen 0.7155983448
aachen salzburg 0.7147209644
aachen magdeburg 0.7133902311
danzigs aachen 0.4972274303
fästningsfyrkanten aachen 0.6193811893
kesselsdorf aachen 0.6016947627
kongresserna aachen 0.6046762466
lechfältet aachen 0.4901075363
ostende aachen 0.6780105829
regensburg aachen 0.8070785999
rhenprovinserna aachen 0.4839214087
tillskapade aachen 0.5965193510
arildm commented 1 year ago

In fact, it probably makes more sense to publish the word embeddings, from which the similarities are calculated. Embeddings are created by the word2vec.py script.