Open LinguList opened 2 years ago
There is this collection, which might be helpful: https://www.marekrei.com/projects/vectorsets/
And we don't need to bother about Spanish word2vec, we only want to see if this works or not, so we can start with English.
So the scope here is to really just develop a few similarity metrics and demonstrate there use and usefulness to integrate into our grander project of incorporating a similarity measure into cognate matching and borrowing detection. Sorry I had lost sight of this for a moment!
We can start with word2vec in English (as @LinguList suggested) and consider glove or fasttext of the more recent vector sets if this seems promising [or even start with fasttext]. An advantage of fasttext is that it works at subword level and so might be more forgiving of smaller discrepancies in word form.
I previously developed a simple class function for finding closest match and performing vector operations to emulate the logical relationships noted by Mikolov in his original papers. Although this was for similarity of IPA segments in my research project for phonology, it at least gives me a starting point.
Here are word2vec https://www.tensorflow.org/tutorials/text/word2vec and fasttext references https://fasttext.cc/docs/en/support.html. I like the API provided by fasttext with some functions already defined, but their 4.5GB download of English seems excessive compared to the 127MB representation of English distributed by Marekrei (above).
Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an examples
repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:
from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish")
Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.
If our product is for use by semantic matching program with limited vocabulary, then we could use any of available embeddings in Spanish or Portuguese (staying with English for prototyping) to develop a resource of semantic distances between words without retaining the embeddings themselves.
Even if we use something big like FastText, as long as we can limit our vocabulary to something like that given above, then we could have an efficient similarity measure.
I am okay to test several variants. We can later also zip the data and then unzip with Python (but I don't know what compression can do here). Since this is what we have in pysem now: we zip the Concepticon data and unzip them when loading the library, which I find fine for this very purpose and performance.
But that would mean we compute for example these similarities for all concepticon entries in Spanish. The link to concepticon can later be inferred or also stored along.
BTW: we can then also compare with pysem's STARLING similarities, which are rarely used but interesting.
Some words for concepts and also the concepts themselves with English gloss are multi-word expressions. We wouldn't look these up directly, but rather add together corresponding word embeddings (dropping entire functional words such as 'the', 'is',...) and as long as there is not a negation as part of the phrase. Negative would result in a subtraction instead of addition.
Would be a bit special handling since not part of the embeddings themselves.
This only warrants our special treatment, as we then use these metrics to provide concept-based similarities through a given language. Guess that even makes it nicer.
Possible use case of Peruvian language with possible Spanish borrowings:
OK, enough for today on this!!
Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an
examples
repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:from pysem.fasttext import similarity print(similarity("mano", "dio", "Spanish")
Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.
Maybe a bit redundant from my comment above:
OK, I've been trying to go directly to a nice solution here for our similarity function similar to what @LinguList describes, but maybe better I'll try in a few steps. Just in case I am going down a garden path to nowhere, this will make sure I don't get too lost! So this week at least I'll have 1 or 2 iterations on developing such measures.
Prototype - used English wordlist from WOLD (n=1480), embeddings from Word2Vec. Limited cleaning of words from wordlist... since we want this to be for other languages. Calculation of all pairs of similarities and then report of top similarities for all words in wordlist. Due to smaller size of wordlist, no problem with space, or time.
With larger wordlist, space and time increases too, approximately as the square of wordlist size.
For vocabulary of 1,500, it's about 25Mb for the all similarities.
So going to 15,000 words will take us to 2.5Gb.
But if we just set a threshold at say the top 100 similarities, then we are back to 25Mb again!
Multi-word words were broken up into separate words and their embeddings added together, since Word2Vec is single word embeddings. This seems OK in general.
lightning bolt, [('lightning', 0.8545579398629692), ('bolt', 0.8294791950147321), ('rattle', 0.7111561418515908)]
old woman, [('old man', 0.9438529208904961), ('woman', 0.8419059624634133), ('old', 0.8200590160797763)]
More earthy expressions seem OK too.
fart, [('surprised', 0.6738966824249598), ('shit', 0.6663192026222906), ('stupid', 0.6431720347809629)]
intestines, [('pus', 0.7338402318487939), ('lung', 0.6745549328497172), ('vagina', 0.6719459557266357)]
Sample of report on top 3 similarities:
% python examples/makesimsforwl.py en --input nnet_vectors.100.txt
Embeddings vocab and vector size: b'132430 100\n'
Embeddings: vocabulary shape (132430, 100).
Size: vocab 10,485,856, embeddings 105,944,128.
Size wordlist 1,483.
Emb is None for netbag.
Emb is None for earwax.
Emb is None for goitre.
Emb is None for ridgepole.
Emb is None for fishhook.
Size: 2950, (1475, 100)
1475
fry, [('butter', 0.8042854349842384), ('mushroom', 0.7832237533109371), ('bake', 0.779592179181041)]
thousand [[numword]], [('zero [[numword]]', 0.9999999999999999), ('five [[numword]]', 0.9999999999999999), ('eleven [[numword]]', 0.9999999999999999)]
cow, [('pig', 0.8156664193500531), ('goat', 0.7868192683936832), ('sheep', 0.7684552210883713)]
fingernail, [('pubic hair', 0.8165453047136368), ('eyelash', 0.8163303040424418), ('forehead', 0.7971889532811858)]
tobacco, [('cigarette', 0.6017402928379825), ('beer', 0.5865598741226794), ('sugar cane', 0.5685409505745453)]
adze, [('spindle', 0.7199174818346881), ('chisel', 0.7089135671344956), ('scythe', 0.6888713543774874)]
same, [('all', 0.5434899643577488), ('certain', 0.5130263619592609), ('this', 0.4873846372182848)]
hour, [('day', 0.7139206736020832), ('week', 0.674579471160923), ('month', 0.6230577859429078)]
four [[numword]], [('three [[numword]]', 0.9263888505161936), ('two [[numword]]', 0.9136727320972422), ('thousand [[numword]]', 0.9087765795191391)]
rich, [('poor', 0.5997822231183239), ('beautiful', 0.5426400096267676), ('farmer', 0.5304052826318136)]
sweets, [('vegetables', 0.7424877220261447), ('oat', 0.727479115233296), ('chili', 0.6851475577781612)]
feather, [('beak', 0.7660150859872727), ('fur', 0.7457001005339128), ('grass-skirt', 0.7364235366406179)]
molar tooth, [('molar', 0.866042299130694), ('tooth', 0.7450526467968077), ('jaw', 0.6625907030130996)]
mad, [('stupid', 0.6886028664386614), ('hell', 0.5959723895311018), ('fuck', 0.5900788603563875)]
quiet, [('calm', 0.6108550346577949), ('silent', 0.5786421637006947), ('happy', 0.5343008682566188)]
row, [('square', 0.5814702629263717), ('fence', 0.5766758836803482), ('corner', 0.5570105393974787)]
stall, [('shop', 0.6325477955329423), ('shed', 0.5645848221567649), ('basket', 0.5410588588616182)]
sour, [('sweet', 0.6617518076736664), ('sweet potato', 0.6498828271965637), ('bitter', 0.6437554500148425)]
cormorant, [('gull', 0.7726610987877149), ('elk', 0.7323552200753589), ('vulture', 0.7312522635456679)]
kill, [('injure', 0.783690821448602), ('dead', 0.7546870853575245), ('attack', 0.7372081098673927)]
be silent, [('silent', 0.8290224765531314), ('alive', 0.5187246123662198), ('remain', 0.5018385311063606)]
have, [('be', 0.6024557496250549), ('that', 0.6001841380569435), ('now', 0.5584250693894303)]
bread, [('cheese', 0.913873037065727), ('butter', 0.9117029890600533), ('soup', 0.8814869916880362)]
debt, [('pay', 0.5594818992876731), ('tax', 0.5593297139708442), ('bank', 0.5555438402428177)]
citizen, [('surrender', 0.5397944548207891), ('govern', 0.5245086218522322), ('country', 0.5121272200289608)]
wall, [('roof', 0.781861785536561), ('stone', 0.744606718460037), ('brick', 0.7338536697433112)]
pay, [('earn', 0.6218248064656049), ('money', 0.6159720907476571), ('tax', 0.5776560505378086)]
horse, [('ride', 0.7273398447119427), ('dog', 0.7055156016290415), ('donkey', 0.702881748835211)]
harvest, [('barley', 0.6656591601710853), ('wheat', 0.657589181841732), ('sow', 0.6266762806539685)]
north, [('south', 0.9061287971740142), ('west', 0.8422640462874527), ('east', 0.7556908191003076)]
thresh, [('earlobe', 0.7033262878150472), ('scythe', 0.6989489765427765), ('hammock', 0.6947732468168636)]
dolphin, [('porpoise', 0.8673096237321372), ('whale', 0.8246386692137381), ('shark', 0.7339749092418)]
ride, [('horse', 0.7273398447119427), ('fly', 0.6522835865819004), ('sail', 0.6307780989824938)]
silent, [('be silent', 0.8290224765531314), ('quiet', 0.5786421637006947), ('darkness', 0.5652802961995096)]
straight, [('go down', 0.6109444386264571), ('pull', 0.6107878782156458), ('walk', 0.6066075947977537)]
dirty, [('rag', 0.6769145691305547), ('wet', 0.5831675490232032), ('bore', 0.5806651588786926)]
earn, [('pay', 0.6218248064656049), ('borrow', 0.5949876604119079), ('lend', 0.5577967991243604)]
bamboo, [('cone', 0.7425212510586381), ('tree trunk', 0.7200045010482233), ('grass', 0.7146550130433396)]
--- 8.09983491897583 seconds ---
Nice start. I am traveling-teaching this week, but when I find time and get back, I should share some info on a bachelor project I supervised, where we discussed various metrics and tried to compare them. Not necessarily all successful, but we could build a bit on that.
It could be an entire study on comparing these similarities, as this has not often been done so far, as far as I know.
No we should decide on a common format (I propose to use zipped json files, some dictionary structure for similarities) and see how we can put this into a simple function.
In parallel, one can check how well these similarities integrate into an SVM in our borrowing detection approach.
We should add some of these in a folder
data/
for now, with additional information and code there. The resulting network file should then be accessible from within thepysem
library.