Addition - Githubissues

l0d0v1c commented 1 year ago

Hi!

I tried your model because mine is not very good in sense addition like king+woman-man=queen the classic example for word2vec.

So I tried: βασιλεύς + γυνή - ἀνήρ = γνόφον σπέρματί εἰσελεύσῃ σπεύσουσιν αἴτησαι κατασφάζω θυγάτηρ παῖς and with yours: βασιλεύς + γυνή - ἀνήρ = βασιλίζω μαρδοχαῖος ἐζεκίας ἀδελφή ἀρταξέρξης γαμετή καλλιρρόη ἡρῴδης

I chose a window of 8. Did you succeeded in vector addition?

Yours my repo is https://github.com/l0d0v1c/Ancient-greek-word2vec

ryderwishart commented 1 year ago

Hi @l0d0v1c, you're right that the classic example is not directly transferrable to the Ancient Greek data, but I think there is more going on with the Greek data than might first appear. For one thing, the Greek data is more morphologically complex. In addition, I actually get interesting results depending on how I mix and match the vectors, depending on the corpus, and depending on the model type (skip gram versus CBOW, FastText vs. Word2Vec, window size, etc.). For instance, with this model ft_papyri&corpus_cbow_hs_2_to_5_size300_window5_mincount2.model (FastText, papyri and literary texts, CBOW, using hierarchical softmax, 2-5 length character ngrams, 300 as vector size, window of 5, min instances of vocabulary item in corpus as 2),

# βασιλεύς + γυνή - ἀνήρ

word_set_1 = ['βασιλεύς', 'γυνή']
word_set_2 = ['ἀνήρ']

# Finding the most similar words using vector arithmetic
similar_words = model.most_similar_cosmul(positive=word_set_1, negative=word_set_2, topn=10)

# Print the most similar words and their similarity scores
for word, similarity in similar_words:
    print(word, similarity)

Yields

φιλοβασιλεύς 0.8599472045898438 # φιλοβασιλεύς = 'royalist' or... maybe, 'king lover'? That seems a bit like a queen in some way.
βασιλειάω 0.8345795273780823 # βασιλειάω = 'to reign'
βασιλίσκος 0.8285424709320068 # βασιλίσκος = 'little king'
βασιλίζω 0.8045151829719543 # βασιλίζω = 'to rule as queen'
γαμβρός 0.7965137958526611 # γαμβρός = 'son-in-law'
βασιλίς 0.7935773134231567 # βασιλίς = 'queen'
βασιλεύω 0.7927768230438232 # βασιλεύω = 'to reign'
βασιληΐς 0.7872297763824463 # βασιληΐς = 'queen'
βασίλη 0.7827581763267517 # βασίλη = 'queen'
συμβασιλεύω 0.7825499176979065 # συμβασιλεύω = 'to reign jointly'

However, if I use the Word2Vec nov2022 model,

# βασιλεύς + γυνή - ἀνήρ

word_set_1 = ['βασιλεύς', 'γυνή']
word_set_2 = ['ἀνήρ']

# Finding the most similar words using vector arithmetic
similar_words = model.most_similar_cosmul(positive=word_set_1, negative=word_set_2, topn=10)

# Print the most similar words and their similarity scores
for word, similarity in similar_words:
    print(word, similarity)

Yields the expected results

βασίλισσα 0.8620691299438477 # NOTE: βασίλισσα = 'queen'
παιδίσκη 0.8539372682571411 # παιδίσκη = 'maid'
ἡρώδης 0.8510062098503113 # ἡρώδης = 'Herod'
ἰσραηλίτης 0.8509696125984192 # ἰσραηλίτης = 'Israelite'
ἀαρών 0.8409229516983032 # ἀαρών = 'Aaron'
κλεοπάτρα 0.8370845913887024 # κλεοπάτρα = 'Cleopatra'
ἀδελφή 0.8362259864807129 # ἀδελφή = 'sister'
βασιλίς 0.8341107964515686 # βασιλίς = 'queen'
ἀριστόβουλος 0.8217142224311829 # ἀριστόβουλος = 'Aristobulus'
γύναιον 0.8093928694725037 # γύναιον = 'woman'

In other words, model hyper parameters matter A LOT. The FastText model, because it breaks down character n-grams, finds a lot more similarity between words that share a derivational code (like 'βασιλ'). It's interesting to observe and ponder how drastically the results change based on a bit of hyperparam tweaking. In the case of transformer models you lose this transparency in the relationship between the algorithm and 'semantic similarity'. Everything is just 'attention'. That's part of the reason I find these more basic algorithms extremely important and suspect they still have a key role to play for lexical modelling.

By the way, I love the web app you made to view the graph of similar words! Are you doing any academic work on Greek lexical semantics? Would love to talk more.

l0d0v1c commented 1 year ago

Hi Thanks a lot! That's clear. You get the point hyperparameters are the key. Anyway in the dataset I used βασίλισσα appears only 4 times so it may explain I get παῖς instead even with your hyperparameters.

With your nov2022 παῖς and βασίλισσα are closer but in mine παῖς is closer to βασιλεύς

I didn't get your mail but I saw you are on researchgate so I'll send you more details by this way about my projects.

PS: Greek is complex but nothing compared to the 4000 rules of the panini grammar of sanskrit;-)

ryderwishart / ancient-greek-word2vec

Addition #1