lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Calculate Semantic Similarity #36

Open LinguList opened 2 years ago

LinguList commented 2 years ago

We have a full semantic network with the CLICS data, that covers all concepts that are linked to Concepticon. But the problem with semantic similarity here is that it is not always clear how to interpret it.

There are several approaches to calculate pairwise similarities, but it is a bit tricky, since we have weights, and a weight is the higher the better, so we don't have shortest paths, but rather paths with most "flow", like a system where you want to pump water from x to y.

One can use some simple way to compute shortest paths by normalizing edge weights, but I am not a real fan of this, as it is difficult to normalize the weights (which is the number of langauges in which a colexification is attested).

So another possibility is to compute random walks, following the procedure laid out in Jackson et al. 2019. I have implemented this already. We could write another implementation in another repo, so we have it at our disposal, or use my actual implementation to get a graph.

The alternative is to just limit the analysis to direct colexifications, but not that attractive.

Yet another solution: do semantic similarity with vector space models.

If we have a full graph on semantic similarities for all concepts vs. all concepts (using whatever approach), we could use an updated SVM approach:

  1. train the model (using randomized training data, with n cross-semantic matches per concept)
  2. applying the approach, restrict the search to the n closes semantic matches and then compare them with the SVM

This workflow could then be applied to many more data points.

@fractaldragonflies, how well do you know vectorspace models? Do you think it would be possible to extract similarities for, say, Spanish words, for our concepts in a network? I'd then work on my end from the CLICS model, so we have two models to compare.

fractaldragonflies commented 2 years ago

So this is get us from our current 3-valued similarity measure (same concept, central concept, any concept) to a real valued similarity measure amenable to use in a SVM or similar classification method. Great.

I haven't read the 'random walk' paper for computing similarity, but I can imagine the computation of conditional probabilities at each concept node based on the CLICS weights. And from that simulated random walks to estimate similarity or distance. Sounds good to me

I don't fully understand about training SVM approach with randomized training data. But sure, the idea of SVM restricting to same concept, versus SVM using as a feature the estimated semantic similarity or distance sounds practical. We could censor candidates to be within some distance x to improve run time if not negatively impacting results.

vector space models - using Spanish words

This gets pretty complicated I think, but here are my thoughts. OK, maybe I'm overthinking it!!

  1. Could use current vector representations such as Word2Vec, FastText, even more current (XLM-R)
  2. Could use Spanish words from Concepticon as glossary 2.1 Not all donor words are in this glossary. [maybe concepts though] 2.2 Some concepts use multiple words. Do we do vector arithmetic to estimate concept? 2.3 Do we index by concepticon gloss of concept? But using similarities based on Spanish embeddings.
  3. Matrix of all possible relations between concepts. [this gets pretty big, but still doable by todays standards since just a few thousand concepts] 3.1 Or use virtual matrix - compute each relation on the fly - or lazy matrix, compute relation once and enter in dictionary.
  4. Use - for target concept, get similarity or distance to candidate donor word/concept.

Fuzzy - getting from Spanish concept to candidate donor word - especially when donor word maybe not mapped to a concept.

Hybrid approach? Vectors for Spanish concepts [maybe estimated from multiple words]. Vectors directly from FastText (or similar) that will accept just about any Spanish word. Compute similarity on the fly or lazy.

LinguList commented 2 years ago

Quick response on randomized training data: I meant, if we feed all combinations, this will take quite a long time to train, so we could make a selection for training. But this is not that important and we'll check later what way we prefer.

LinguList commented 2 years ago

Yes, it is a bit complex. I prefer to have all computed in a stable way, no real surprises. If we assume that we start from a Spanish lookup-list (as we had before) with all items from IDS linked to Concepticon, all we'd have to do is get scores for this master list all words against all words. Here, we can tokenize the Spanish gloss for a Concepticon entry by comma-splitting and the like, and take the mean of the scores.

I'd say, computing similarity metrics for these cases warrants a new concept list of itself, which we should curate in an extra repository, to not have too much in this repo. We could make a new repository for semantic similarity metrics. Maybe just place it in LingPy, where we store data? We could also render this similarity data in CLDF formats, but we may also restrict it to rendering similarities in JSON or other formats for now.

LinguList commented 2 years ago

A good place to start with similarity metrics could be pysem. What we'd do here is: add a folder data/ to the repository, in which we add subfolders for certain similarities, like word2vec_spanish or similar. PySem offers one more semantic similarity metric, based on an approach by Starostin: https://calc.hypotheses.org/2465

So adding more similarity metric modules to pysem is not the worst idea, I think.