Similarity Metrics based on Word2Vec for Concept Lists in Different Languages

LinguList commented 2 years ago

We should add some of these in a folder data/ for now, with additional information and code there. The resulting network file should then be accessible from within the pysem library.

LinguList commented 2 years ago

There is this collection, which might be helpful: https://www.marekrei.com/projects/vectorsets/

LinguList commented 2 years ago

And we don't need to bother about Spanish word2vec, we only want to see if this works or not, so we can start with English.

fractaldragonflies commented 2 years ago

So the scope here is to really just develop a few similarity metrics and demonstrate there use and usefulness to integrate into our grander project of incorporating a similarity measure into cognate matching and borrowing detection. Sorry I had lost sight of this for a moment!

We can start with word2vec in English (as @LinguList suggested) and consider glove or fasttext of the more recent vector sets if this seems promising [or even start with fasttext]. An advantage of fasttext is that it works at subword level and so might be more forgiving of smaller discrepancies in word form.

I previously developed a simple class function for finding closest match and performing vector operations to emulate the logical relationships noted by Mikolov in his original papers. Although this was for similarity of IPA segments in my research project for phonology, it at least gives me a starting point.

Here are word2vec https://www.tensorflow.org/tutorials/text/word2vec and fasttext references https://fasttext.cc/docs/en/support.html. I like the API provided by fasttext with some functions already defined, but their 4.5GB download of English seems excessive compared to the 127MB representation of English distributed by Marekrei (above).

LinguList commented 2 years ago

Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an examples repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:

from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish")

Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.

fractaldragonflies commented 2 years ago

If our product is for use by semantic matching program with limited vocabulary, then we could use any of available embeddings in Spanish or Portuguese (staying with English for prototyping) to develop a resource of semantic distances between words without retaining the embeddings themselves.

FastText is better for inexact matches of words since it uses sub-word level to develop embeddings, but maybe factor of 10 larger. But if not retained after developing distances, then not a problem.
We could experiment with both Word2Vec and FastText.
Algorithm for calculating distances would be to just load embeddings into NumPy and compute (vector calculation) dot products of cosine distances between all words in our wordlist.
- Assumption: our wordlist is order of thousands and not 10s of thousands or greater.
- Can calculate both dot product and cosine distance measures. Dot product still retains something of frequency information.
Resulting data structure would be:
- bidirectional-dictionary (like we use for segments) between words and indices into similarity store.
- either pairwise (triangular to reduce space if really big) matrix of similarities where (x, y) indices a similarity measure.
- For 3,000 words, 4.5M triangular similarity matrix size.
- or sparse dictionary where only similarities > S are retained as order dictionary of (idx, sim) pairs.
- For 3,000 words and limit to 50 most similar words (phrases), 0.3M dictionary size.
- Sparse structure might be both space and time efficient versus the other. As long as 50 (or similar) is OK size for candidate words.

Even if we use something big like FastText, as long as we can limit our vocabulary to something like that given above, then we could have an efficient similarity measure.

LinguList commented 2 years ago

I am okay to test several variants. We can later also zip the data and then unzip with Python (but I don't know what compression can do here). Since this is what we have in pysem now: we zip the Concepticon data and unzip them when loading the library, which I find fine for this very purpose and performance.

But that would mean we compute for example these similarities for all concepticon entries in Spanish. The link to concepticon can later be inferred or also stored along.

BTW: we can then also compare with pysem's STARLING similarities, which are rarely used but interesting.

fractaldragonflies commented 2 years ago

Some words for concepts and also the concepts themselves with English gloss are multi-word expressions. We wouldn't look these up directly, but rather add together corresponding word embeddings (dropping entire functional words such as 'the', 'is',...) and as long as there is not a negation as part of the phrase. Negative would result in a subtraction instead of addition.

Would be a bit special handling since not part of the embeddings themselves.

LinguList commented 2 years ago

This only warrants our special treatment, as we then use these metrics to provide concept-based similarities through a given language. Guess that even makes it nicer.

fractaldragonflies commented 2 years ago

Possible use case of Peruvian language with possible Spanish borrowings:

Researcher has available dictionary of common Spanish terms including the Concepticon and IDS inventories. Add to that Spanish words and phrases from constructed bi-lingual dictionary.
Use this to limit search for similar words (in both phonetically and semantically), so as not to deal with massive embeddings - except for the first time when selecting embeddings to construct similarities.3.
Then, for each target word in bi-lingual dictionary: 3.1 check for sound sequence similarity of Spanish words with sufficient semantic similarity. 3.2 maybe double thresholds, or an SVM learned function based on some initial training set? 3.3 using a function something like what @LinguList presented above (and copied below).

OK, enough for today on this!!

Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an examples repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:
from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish")
Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.

Maybe a bit redundant from my comment above:

We probably don't want to compute similarities between all words from something like Word2Vec or FastText as they index hundreds of thousands of words.
But with an additional list of candidate words for a particular language (e.g., the 3,000 words from combined Concepticon and IDS Spanish wordlists) plus the Spanish entries from a bi-lingual field researcher's dictionary, might be used to select a more relevant 5,000 words?
With that we have only 12 million resulting pairwise similarities, or just 500 thousand if we limit to just top 100 most similar. This is pretty doable.

fractaldragonflies commented 2 years ago

OK, I've been trying to go directly to a nice solution here for our similarity function similar to what @LinguList describes, but maybe better I'll try in a few steps. Just in case I am going down a garden path to nowhere, this will make sure I don't get too lost! So this week at least I'll have 1 or 2 iterations on developing such measures.

fractaldragonflies commented 2 years ago

Prototype - used English wordlist from WOLD (n=1480), embeddings from Word2Vec. Limited cleaning of words from wordlist... since we want this to be for other languages. Calculation of all pairs of similarities and then report of top similarities for all words in wordlist. Due to smaller size of wordlist, no problem with space, or time.

With larger wordlist, space and time increases too, approximately as the square of wordlist size.

For vocabulary of 1,500, it's about 25Mb for the all similarities.
So going to 15,000 words will take us to 2.5Gb.

But if we just set a threshold at say the top 100 similarities, then we are back to 25Mb again!

Multi-word words were broken up into separate words and their embeddings added together, since Word2Vec is single word embeddings. This seems OK in general.

lightning bolt, [('lightning', 0.8545579398629692), ('bolt', 0.8294791950147321), ('rattle', 0.7111561418515908)]
old woman, [('old man', 0.9438529208904961), ('woman', 0.8419059624634133), ('old', 0.8200590160797763)]

More earthy expressions seem OK too.

fart, [('surprised', 0.6738966824249598), ('shit', 0.6663192026222906), ('stupid', 0.6431720347809629)]
intestines, [('pus', 0.7338402318487939), ('lung', 0.6745549328497172), ('vagina', 0.6719459557266357)]

Sample of report on top 3 similarities:

% python examples/makesimsforwl.py en --input nnet_vectors.100.txt
Embeddings vocab and vector size: b'132430 100\n'
Embeddings: vocabulary shape (132430, 100).
Size: vocab 10,485,856, embeddings 105,944,128.
Size wordlist 1,483.
Emb is None for netbag.
Emb is None for earwax.
Emb is None for goitre.
Emb is None for ridgepole.
Emb is None for fishhook.
Size: 2950, (1475, 100)
1475
fry, [('butter', 0.8042854349842384), ('mushroom', 0.7832237533109371), ('bake', 0.779592179181041)]
thousand [[numword]], [('zero [[numword]]', 0.9999999999999999), ('five [[numword]]', 0.9999999999999999), ('eleven [[numword]]', 0.9999999999999999)]
cow, [('pig', 0.8156664193500531), ('goat', 0.7868192683936832), ('sheep', 0.7684552210883713)]
fingernail, [('pubic hair', 0.8165453047136368), ('eyelash', 0.8163303040424418), ('forehead', 0.7971889532811858)]

tobacco, [('cigarette', 0.6017402928379825), ('beer', 0.5865598741226794), ('sugar cane', 0.5685409505745453)]
adze, [('spindle', 0.7199174818346881), ('chisel', 0.7089135671344956), ('scythe', 0.6888713543774874)]
same, [('all', 0.5434899643577488), ('certain', 0.5130263619592609), ('this', 0.4873846372182848)]
hour, [('day', 0.7139206736020832), ('week', 0.674579471160923), ('month', 0.6230577859429078)]
four [[numword]], [('three [[numword]]', 0.9263888505161936), ('two [[numword]]', 0.9136727320972422), ('thousand [[numword]]', 0.9087765795191391)]
rich, [('poor', 0.5997822231183239), ('beautiful', 0.5426400096267676), ('farmer', 0.5304052826318136)]
sweets, [('vegetables', 0.7424877220261447), ('oat', 0.727479115233296), ('chili', 0.6851475577781612)]
feather, [('beak', 0.7660150859872727), ('fur', 0.7457001005339128), ('grass-skirt', 0.7364235366406179)]
molar tooth, [('molar', 0.866042299130694), ('tooth', 0.7450526467968077), ('jaw', 0.6625907030130996)]
mad, [('stupid', 0.6886028664386614), ('hell', 0.5959723895311018), ('fuck', 0.5900788603563875)]
quiet, [('calm', 0.6108550346577949), ('silent', 0.5786421637006947), ('happy', 0.5343008682566188)]
row, [('square', 0.5814702629263717), ('fence', 0.5766758836803482), ('corner', 0.5570105393974787)]
stall, [('shop', 0.6325477955329423), ('shed', 0.5645848221567649), ('basket', 0.5410588588616182)]
sour, [('sweet', 0.6617518076736664), ('sweet potato', 0.6498828271965637), ('bitter', 0.6437554500148425)]
cormorant, [('gull', 0.7726610987877149), ('elk', 0.7323552200753589), ('vulture', 0.7312522635456679)]
kill, [('injure', 0.783690821448602), ('dead', 0.7546870853575245), ('attack', 0.7372081098673927)]
be silent, [('silent', 0.8290224765531314), ('alive', 0.5187246123662198), ('remain', 0.5018385311063606)]
have, [('be', 0.6024557496250549), ('that', 0.6001841380569435), ('now', 0.5584250693894303)]

bread, [('cheese', 0.913873037065727), ('butter', 0.9117029890600533), ('soup', 0.8814869916880362)]
debt, [('pay', 0.5594818992876731), ('tax', 0.5593297139708442), ('bank', 0.5555438402428177)]
citizen, [('surrender', 0.5397944548207891), ('govern', 0.5245086218522322), ('country', 0.5121272200289608)]
wall, [('roof', 0.781861785536561), ('stone', 0.744606718460037), ('brick', 0.7338536697433112)]
pay, [('earn', 0.6218248064656049), ('money', 0.6159720907476571), ('tax', 0.5776560505378086)]
horse, [('ride', 0.7273398447119427), ('dog', 0.7055156016290415), ('donkey', 0.702881748835211)]
harvest, [('barley', 0.6656591601710853), ('wheat', 0.657589181841732), ('sow', 0.6266762806539685)]
north, [('south', 0.9061287971740142), ('west', 0.8422640462874527), ('east', 0.7556908191003076)]
thresh, [('earlobe', 0.7033262878150472), ('scythe', 0.6989489765427765), ('hammock', 0.6947732468168636)]
dolphin, [('porpoise', 0.8673096237321372), ('whale', 0.8246386692137381), ('shark', 0.7339749092418)]
ride, [('horse', 0.7273398447119427), ('fly', 0.6522835865819004), ('sail', 0.6307780989824938)]
silent, [('be silent', 0.8290224765531314), ('quiet', 0.5786421637006947), ('darkness', 0.5652802961995096)]
straight, [('go down', 0.6109444386264571), ('pull', 0.6107878782156458), ('walk', 0.6066075947977537)]
dirty, [('rag', 0.6769145691305547), ('wet', 0.5831675490232032), ('bore', 0.5806651588786926)]
earn, [('pay', 0.6218248064656049), ('borrow', 0.5949876604119079), ('lend', 0.5577967991243604)]
bamboo, [('cone', 0.7425212510586381), ('tree trunk', 0.7200045010482233), ('grass', 0.7146550130433396)]

--- 8.09983491897583 seconds ---

LinguList commented 2 years ago

Nice start. I am traveling-teaching this week, but when I find time and get back, I should share some info on a bachelor project I supervised, where we discussed various metrics and tried to compare them. Not necessarily all successful, but we could build a bit on that.

It could be an entire study on comparing these similarities, as this has not often been done so far, as far as I know.

No we should decide on a common format (I propose to use zipped json files, some dictionary structure for similarities) and see how we can put this into a simple function.

In parallel, one can check how well these similarities integrate into an SVM in our borrowing detection approach.

lingpy / pysem

Similarity Metrics based on Word2Vec for Concept Lists in Different Languages #10