lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Discussion on workflows, methods, etc... toward clear definitions, names, scope, ... #4

Closed fractaldragonflies closed 2 years ago

fractaldragonflies commented 2 years ago

Issue to capture discussion on workflows, methods, etc, ... with aim toward clear definitions, naming, scope of this effort. This initial comment just to get started. Discussion will add and respond to comments. [Suggested by @LinguList ]

fractaldragonflies commented 2 years ago

Based on email discussion with @LinguList it seems that we should exclude the PyBor workflow from our discussion of SaBor. Published results are still available and we can cite if we choose from those publications. But without suggestion that results can be duplicated within SaBor.

LinguList commented 2 years ago

My impression is we have some confusion on the terms we use for the methods we came up with, so we need some summary on these methods, which we can write in the documentation of them.

My other impression is that the workflow should ideally split analysis and evaluation.

Then, I think that the Transformer or LSTM workflow CAN and maybe SHOULD be used here as well, to have some alternative assessment for a nice study, but that it should be then implemented here, so that we can run all code at once.

LinguList commented 2 years ago

By now, we have the following methods, as far as I see:

  1. with lists of candidates, search for words with low distance using pairwise alignments and report them as potential matches (currently called "pairwise")
  2. search for cognates in all the data, using traditional cognate detection methods, and later report the matches involving our candidate languages (currently called "multiple")
  3. the lingrex-workflow which searches for cognates inside a language family first and than across language families (currently called ?, maybe "two thresholds"?)
  4. train a recurrent NN (LSTM, Transformer) with borrowings from the candidate languages and measure the entropies to identify borrowings (call it "transformer" ?), this method is not implemented here, so so far, results were reported based on code by @fractaldragonflies

We can think of enhancing all methods further (apart from 4, which would need to be added here first). We might need one method like an SVM that wraps up different kinds of evidence, to make this more interesting for computational linguistics people.

fractaldragonflies commented 2 years ago

With respect to workflow split:

My other impression is that the workflow should ideally split analysis and evaluation.

Pairwise is so fast that there is little penalty to pairing evaluation and other reporting options with the analysis, especially since there isn't an established intermediate file format. Maybe Pairwise which defaults to evaluation as its report out. Options to report out the pairwise counts of shared cognates, and a detail diagnostic report by entry/form.

Other analysis methods (what is called Multiple now but we could call Cluster or ?) which do Cluster or Partial or LingRex (Cluster or Partial with sequential internal and cross family analyses) take more time and so it makes sense to at least save the intermediate analysis file as LexStat wordlist. As for the Pairwise we could readily include just evaluation as the default report, but shared cognate counts and detail diagnostic reports could be included initially as an option, or performed subsequently as using the intermediate analysis file.

I have been rewriting the Multiple (or Cluster or ?) module without Pandas and with better separation of functions (less coupling and more cohesion), which would also supports separate analysis, evaluation, report commands.

Then, I think that the Transformer or LSTM workflow CAN and maybe SHOULD be used here as well, to have some alternative assessment for a nice study, but that it should be then implemented here, so that we can run all code at once.

How would we do this?

fractaldragonflies commented 2 years ago

Some comments on methods and naming:

By now, we have the following methods, as far as I see:

  1. with lists of candidates, search for words with low distance using pairwise alignments and report them as potential matches (currently called "pairwise")

If we like naming by analytical method, then this seems a good name. It seems a natural name, even.

  1. search for cognates in all the data, using traditional cognate detection methods, and later report the matches involving our candidate languages (currently called "multiple")

'Multiple' seemed more descriptive of the analytical method than just 'analysis' since it pays tribute to clustering and alignment with multiple forms across languages. But it isn't much more informative than 'analysis' given the common use of the word. So maybe 'cluster' and 'partial' corresponding to the different analysis methods? As per my response above, we could default to also performing the evaluation after the analysis -- time wise it has negligible cost.

  1. the lingrex-workflow which searches for cognates inside a language family first and then across language families (currently called ?, maybe "two thresholds"?)

Currently called 'analyzelingrex' (to not shadow the name 'lingrex'), and lingrexfscore, and lingrexreport. 'Two thresholds' - this would go away from naming the analytical method, but it does capture an important part of the optimization of the model. 'Internal-external' as in the LingRex code, or 'within-across' captures the clustering domains, or 'dual-scope', 'dual-domain'? (a short brain storm on names).

  1. train a recurrent NN (LSTM, Transformer) with borrowings from the candidate languages and measure the entropies to identify borrowings (call it "transformer" ?), this method is not implemented here, so so far, results were reported based on code by @fractaldragonflies

How and when we do that is the question?

We can think of enhancing all methods further (apart from 4, which would need to be added here first). We might need one method like an SVM that wraps up different kinds of evidence, to make this more interesting for computational linguistics people.

Read use of SVM by Alina Cristea (on borrowed Latin words in Romance Languages) just on tokens, but much richer feature set than we used previously. She found superior to a direct classification RNN. But yes use of SVM or other appropriate method. By wraps up 'different kinds of evidence' are you talking about features based on forms or additional problem features?

fractaldragonflies commented 2 years ago

Right now we have - well not all cleanly separated yet, but getting there:

Some issues:

LinguList commented 2 years ago
  1. I think splitting evaluation from analysis is making code more modular: analysis yields output x, evaluation takes x to produce evaluation report, that is what I think, to avoid that we use different evaluation codes for different methods (by accident)

  2. I am not necessarily saying we need to add PyBor NN, etc., but rather that since it was often reported as an alternative result in our discussions (if I understood this correctly) that we should then stop comparing to these, if we do not want to use them. It make sense to say: the PyBor and beyond transformer code does not belong here, as we have a supervised setting with test-train splits, while we have a different approach in our pairwise, multiple, etc. methods

  3. SVMs can be used to learn how to aggregate results, so we feed the alignment score of local alignments, semi-global, etc., along with many other scores in a vector and then have the SVM learn an output (1 or 0), That's what I meant, I can try and show how this could work (in these binary classification tasks, SVM is typically working very well, indeed). I also read this by Cristea, and I independently could confirm their results, as I was experimenting also with NNs for classifiation as provided by SKLEARN. (see here https://arxiv.org/abs/2204.04619)

LinguList commented 2 years ago

Yes, the evaluation is also a problem. Since there are no full cognates, we cannot use b-cubed scores.

LinguList commented 2 years ago

What about I double-check and run next week code by code, and we then discuss again? I'd start with pairwise, and get back to you? We need to make sure that the code is modular and easy to handle. Typically, if a simple method does not work, nothing works, so we should aim for simple methods for now, and also for simple, clear code.

fractaldragonflies commented 2 years ago

I hope to have finished an iteration of better partitioning code and push to the repo this weekend - well maybe Monday if I get bogged down. I can do PR and you can check; merge if OK; available to run. Evaluation for Cluster, partial, and lingrex (dual) methods would be same function; likewise for reports.

Pairwise is a different animal than the others at the moment. Reporting of detail diagnostics is similar. Evaluation and cognate sharing is strictly versus the donor languages. So there is no issue of excluding cognates that do not include donors. Of course the precision, recall, F1 score, accuracy calculation is the same function once the predicted (match with a donor) and actually borrowed status is calculated.

LinguList commented 2 years ago

For me, the workflow should always split analysis from evaluation, since we can then think of an application in OTHER contexts, where evaluation is not possible or needed, you see? If you do a clean-up of things and a PR, I'll check that and then go through all of the code next week and see where we are. From there, we can then advance and discuss how to implement an SVM model that learns to pick from the best hints of multiple methods taken together.

fractaldragonflies commented 2 years ago

Donor focused evaluation Having implemented donor-focused evaluation for cluster/multiple alignment methods - no small effort, but now quite concise, I realize that we could better capture the distinctions by creating 3 categories for borrowing prediction and borrowing truth. With 3 categories, we could also use directly SciKit F1score function to extract 'micro', 'macro' and 'weighted' scores. Where 'micro' is a particular view such as donor-focused.

Variables could be:

With this 3 way classification of predicted and truth we could create a confusion matrix and various F1 related scores.

I'll review this more carefully after completing our pass of cleaning code, reducing code, implementing workflow.

Here is an example from Medium

LinguList commented 2 years ago

I have not really given it much thought so far. But if you run Spanish vs. N languages to find Spanish borrowings, it is a binary decision, and a four-field table, right? The method gives 1 for a word in our non-Spanish list, judging the word is borrowed, and 0 if it judges the word is not borrowed. Then you can do that also with Portuguese. If we stick to this schema, it is easy to follow for readers, right?

LinguList commented 2 years ago

And being easy to follow for readers is one of the most important aspects of evaluations. A combination can be done later, but for now, I'd report one score for Spanish, one for Portuguese for all methods, and for the methods that theoretically can identify borrowings from other language families, I'd report a general classification "is borrowed" vs. "is not borrowed".

LinguList commented 2 years ago

But that's probably exactly what your averaging procedure suggests, right?