Define the primary goal of the library

lingpy / pybor

A Python library for borrowing detection based on lexical language models

Apache License 2.0

3 stars 1 forks source link

Define the primary goal of the library #6

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

We have some problems with the code, since it was never stated what it is supposed to do. By now, it:

plots distributions
supposedly calculates things like entropy from some data
requires rather specific wordlists and hacks certain of their elements implicitly (like the charlist, which is supposed to delete things like questionmarks, but is not necessarily discussed, and no longer needed, as lexibank data has deleted these characters in all forms more explicitly)
something more?

LinguList commented 4 years ago

I think:

the entropy distribution plots are typical plotting tasks that are hard to test and should not be core part of the librarie's functions, so they are placed in a script plot.py, and this is called from commandline in commands, but can also be ignored, distributions could also be plotted in R and other libraries
as we talk about monolingual borrowing detection, we may want to revise this, as we are not really detecting borrowings by now, right? I would very like to have one function TRYING to do supervised borrowing detection, but so far, I do not know where this function has been proposed within the library. The plots are for illustrating problems, but the core part to understand if it works or not is to have a detection function, supervised or unsupervised, and see how it works (!)
having code that evaluates a word in context of other words seems useful and the primary stuff happening now, based on ngram and other code. So we have a bunch of words, we have some extra information for supervised learning, and then we can train the model and do something with it, but so far, this is mostly done for plotting, while I want to see the results, and examples of what it can do.

So what is the basic structure here, that seems to me was never really discussed.

Maybe it is like this:

some dataset is taken from somewhere (lexibank), and the data needs to be annotated for borrowed words
if the data is annotated for borrowed words, we can train a method (a markov model or similar) for supervised borrowing detection (whether it works or not), based on the annotation
based on the method we have trained, we can make direct judgments on words and identify, e.g., more borrowings for new words fed to the method, or by having only a fraction of the data annotated, we can annotate the rest
we can also analyze our models (this is what most of the code is doing now), e.g., by comparing the distributions to see if the method would work well for supervised borrowing detection
we can visualize the analysis by plotting parts (highly specific, not tested)
we need to provide code to evaluate how well borrowings are detected

So I propose to have some longer thoughts on this exactly and discuss how we manage the code. I think, many of the analysis part 4 here has so far been as the major aspect of the code, but we need to specify it again: we are trying to do supervised detection of borrowings from monollingual data. This is also the evaluation experiment we are doing here.

Maybe, "mobor" is not the best name. It is distribution-based borrowing detection, not necessary mono-lingual, although we work in mono-lingual models for now.

I think it is crucial to use the next week to think from start again about the major goals of this library and of the study. The goal cannot solely be to have nice plots or to analyze distributions. It must consist in providing direct results.

tresoldi commented 4 years ago

The plotting of distributions could be separated from the computation, and in a way it already is (see how the results are written to a tabular file), but for code elegance the split should be clearer. Not so urgent, however, as the raw entropies don't really give much information and the plot is as a final and mandatory product as the scores.

As for the data, I agree that we could separate it a bit from the Lexibank WOLD dataset, but there are already improvements in this sense. If you look at my changes, while it still requires some WOLD-specific features (like the "borrowed" column), it is more like a general tool -- it should work out-of-the-box with datasets like ASJP and NorthEuralex that have the same annotation (and you can in fact call the from the command line).

I would wait for a minimal integration of the RNN code to reorganize, but I follow @LinguList here and we should specify/discuss in a clearer way what this package is supposed to do. In general terms, I'd say it allows to compare distributions of entropy, both within a single collection of words (like a single TOKENS column), and across different collections (either splitting a BORROWED/NATIVE or different languages, as a full-powered language identifier).

LinguList commented 4 years ago

One thing that is currently also not done yet is the potential to filter one type of borrowings in the annotated part of supervised borrowing detection. If one tags WOLD's French borrowings in English, for example, the supervised learning of borrowed words is not that bad.

So what it breaks down to in terms of abstraction is, in my opinion, to have (1) a cldf/lexibank dataset as a minimal requirement of input data, to filter (2) by a specific language within this dataset, and to derive (3) the supervised part of the data from one of the columns in the is dataset (which should be flexible, e.g., the "borrowed" annotation in wold is very specific, but one can also have a column only providing the source language, etc.

In fact, with the Kluge data, which is extremely annotated, and which we used the last time, one could make a smaller subset of data in high annotation quality of German words as a new cldf-dataset that one can use for testing.

All custom and data-specific functions should be added to an examples/ folder, or a code/ folder, not inside src/ but at the same level.

Note that this is not necessarily what I did, so I also encouraged to write the code differently, but when I worked with the package the first time, I did this to sort the different things happening there, in order to understand what is going on. Now, that I see it more clearly, I think all wold-specific things should disappear later on, and we should concentrate very strictly on the basic tasks:

load data from cldf and filter them to make a language model that can discriminate between a "normal" word and a non-normal word based on criteria you guys chose (as this is not my area of expertise)
allow to analyze the data in some form, and to visualize the analysis (as part of an extra of the library)
allow to test the performance of a language model in discriminating words

I don't know if more is really needed.

fractaldragonflies commented 4 years ago

I'll respond in several parts. We can compile/synthesize our primary goal based on our discussion.

My short term motivation and goal was to get what I had done for my research, and specifically for the draft paper, into a format and repository that you (Mattis and Tiago) and César could accept as sound, usable, and hopefully useful code that supports the results and findings of the draft paper. And furthermore that we could use to publish as additional material accompanying the paper.

I didn't know what that entailed at the time, but figured that having someone else look at what I've done and be able to vouch for the results and findings of the paper were well worth the investment . Furthermore, I knew I would be learning new ways and new tools (for me) of working that would help me in the short and long term. And also, to work with folks from Max Planck, is unlikely to be perceived as a minus in my career experience.

So my original purpose was:

Put code in a format that would be readable and understandable to others, conform with standards, support the results and findings of the paper, and allow others to verify this.

--- More to come ---

fractaldragonflies commented 4 years ago

Somewhat longer term motivation and goal was to make this library an assembly/offering of the good parts of research that we (individually @tresoldi and @LinguList) have are doing on loan word detection and discrimination leveraging all the other advantages of LingPy and open data resources (@xrotwang) to make it library useful in general for ourselves and anyone else working with problems of loan word detection, our who might benefit from having loanword detection tools available for related historical linguistic work.

To that end the factors that @LinguList and @tresoldi elaborated above and others I see as part of this library:

Access to word tables in cldf format - WOLD, ASJP, others as they become available,
- Not my initial focus, but I certainly see the value of extensible data sources.
Description of entropy distributions (not primary importance as said above),
- Markov and recurrent neural network (RNN) entropy models.
Discrimination between native and loan words.
- Monolingual basis in the first version.
- My contribution to the library represents the current state of my research.
- Various models:
  - Native word training based entropy model,
  - Competing native and and loan based training entropy models,
  - [unsuccessful and not included] non-supervised model training entropy model.
- Various modeling methodologies: -N-gram entropy models - n-gram model work from @tresoldi, Markov model from NLTK,
  - Recurrent neural net model of entropy,
  - Recurrent neural net direct model of native versus loan word.
- Incorporate thoughts of Mattis on using not just sound patterns, but also the sound itself as evidence.
- Multilingual basis in subsequent versions.
- Work in progress but will minimally consider/use:
  - Cognate matching using sound classes to identify cognates between candidate loaner and receiver languages,
  - Competing entropy models concept from monolingual approach extended to candidate loaner and receiver languages.
- Both Markov and RNN entropy methods, direct RNN method, LingPy cognate detection methods.
Unsupervised methods should the become available.

I haven't discussed much with @tresoldi his work in entropy modeling or loan word detection, so I am under representing what we can gain from his work. Similarly, I haven't discussed all the benefits of using the infrastructure of open data, LingPy, other tools or the various Max Planck repositories.

OK, that was more a list of stuff than a coherent primary goal. But maybe I'm corralling the big beast as part of the discussion!

LinguList commented 4 years ago

So I think we can basically split this into:

borrowing detection, three models, based on a unified model for lexical data
language model analysis, potentially three separate parts for three models
evaluation of borrowing detection, one part (may include k-fold things, is this the same as cross-validation? I am helpless with the terminology, but it helps me to understand what part what aspect belongs to)
visualization

I think this will help us to make for ourselves clearer what part of code is doing what.

Pragmatically speaking, the analysis part is what is used to explain if something's not going, right? So the starting point would be the methods for borrowing detection, as they need to be there to have the very rough, crude, and direct exposure of problems. In a paper, they come first, in the material and methods part. They take one word as input and return, for the criterion chosen, if it is a borrowing or not.

The analysis of the probability distributions is a smaller part in the evaluation section, that may be needed to explain why things are not working, or where they are working better. But the primary evalution would be in f-scores, false positives and negatives, as these are easy to understand, also for normal linguists.

Plots can be treated as custom aspects, depending on how well they can be generalized, but they also belong to the analysis, that I would put behind the primary goal of having the function of supervised borrowing detection.

The call signature should be unifieded, with parameters being:

a set of words (in different encodings)
a criterion to distinguish between the native words and what to compare with them (borrowing true-false, but also a specific donor language, or semantic classes)
a threshold by which a word is judged to belong to one of the two classes

Data selection would be done in a specific module that accesses cldf, applies some operations on it, and brings the data into the form we need for input (either two lists of words or a list of words tagged for the native-something-else distinction. As it is the first step, it can also just take a part of the data, to allow for evaluation, and the like.

Plots and analyses can be offered in a generic way, where this makes sense.

So here's the question: how far away are we from achieving the detection part with evaluation only against f-scores (false negatives, false positives, etc.)?

I can easily design the data module (we have it already more or less). The evaluation is also rather trivial. So can we have the detection illustrated for Markov models?

I think it is best we start from these parts, if we agree on this procedure. For a paper it would make most sense, as it would also be more straightforward and easy to understand for linguists.

LinguList commented 4 years ago

In terms of work division, I think we can do it like this: We start from a function that helps to DETECT borrowings as described. Assuming uniform data as input, with a function call like:

class Method:
    def __init__(self, data, **kw):
        ....
    def estimate(self, word, threshold=None, **kw):
        ....
        return True if ... else False # alternatively, a score between 0 and 1 or similar

And the data structure is:

[[word, 1], [word2, 0], [word3, 1]]

As @tresoldi did the Markov models and should port them from linpgy, I suggest, @tresoldi can start here, @fractaldragonflies could do the same for RNNs, and I will do it for the strange-sound approach.

We should use one unified test dataset for this. We can pull out some 500 (half of them) words from English and tag them for borrowed or not, in this form, and add this as a Python module, a list.

In this way, we can work on this during the next weeks, using this as development data, and then see where we get?

I can also produce the list from the German data, maybe easier to do. So should we do this? We hae three scripts inside our "mobor" package (I'd probably consider renaming to pybor or something similar): rnn.py, markov.py, and sounds.py

fractaldragonflies commented 4 years ago

I largely agree with...

So I think we can basically split this into:

borrowing detection, three models, based on a unified model for lexical data

language model analysis, potentially three separate parts for three models

evaluation of borrowing detection, one part (may include k-fold things, is this the same as cross-validation? I am helpless with the terminology, but it helps me to understand what part what aspect belongs to)

visualization.

I had lumped borrowing detection with distribution analysis at the high level (analysis), but OK breaking into separate modules. The distribution analysis that is currently in the command script could move to the analysis module.

The borrowing detection (discrimination) currently focuses on training and then evaluating borrowing for a validation dataset... estimation is part of this of course. Maybe the current emphasis on training the model, and evaluating the model on validation data still pertains to analysis, but the focus on estimation/decision for individual words and application to new text belongs in a separate estimation module for ease of use?

Evaluation by k-fold validation, is currently treated as analysis, but can be moved to a separate module, where it would invoke individual analyses from the analysis or discrimination module.

Visualization is already separated, so good. Nothing controversial!

When @LinguList says 3 models do you mean ngram/Markov entropy, recurrent neural net entropy, and individual 'strange' sound basis? We can also include direct recurrent neural net as well, a non-entropy basis.

fractaldragonflies commented 4 years ago

This from @LinguList is right-on:

The call signature should be unified, with parameters being:

a set of words (in different encodings)

a criterion to distinguish between the native words and what to compare with them (borrowing true-false, but also a specific donor language, or semantic classes)

a threshold by which a word is judged to belong to one of the two classes

In the practice (initial experiment) I did, the availability of 'borrowed' from WOLD and a flexible criterion was beneficial. Simple matter to provide a function instead of numerical criterion for deciding loan versus native on the training data. Furthermore, this allows the possibility of finer grained discrimination -- Norman French borrowed by English, Greek, ... .

@LinguList asks:

So here's the question: how far away are we from achieving the detection part with evaluation only against f-scores (false negatives, false positives, etc.)?

This is already a part of the analysis module, it's more a question of porting it to the new structure.

LinguList commented 4 years ago

The borrowing detection (discrimination) currently focuses on training and then evaluating borrowing for a validation dataset... estimation is part of this of course. Maybe the current emphasis on training the model, and evaluating the model on validation data still pertains to analysis, but the focus on estimation/decision for individual words and application to new text belongs in a separate estimation module for ease of use?

I am not sure I understand this completely. For me, the workflow for detection is like this:

load the data that is annotated by instantiating the class (this is the training data, from which the model is generated, and the model is then used to predict for so far unobserved data)
offer a function that estimates or classifies (it is essentially what I call detect) for a new word, to which of the two classes it belongs

This two-step workflow can be done with different methods and different parameters. But each method should have this call signature (maybe let's call it "Method.classify(word)" to avoid confusion, with classification being between 0 and 1.

And each method we have is following one approach to modeling a language. E.g., the strange-sound model can be done with a support vector machine, where I break down each word into its sound spectrum in a simple vector and have an SVM learn from my training data, what is more likely, etc.

I'd say: a method is one approach to modeling words, like bunch of sounds, Markov, or RNN, so I'd not split these further, and start with three.

fractaldragonflies commented 4 years ago

We have similar concepts of the work flow, just different focuses.

A function to classify words was an integral part of my previous analysis-discrimination module, but the focus was on the aggregate of results and validation of the method (in order to offer support for the method),
The same function to classify words (indeed more versatile since now we could produce probability estimates as well as logical), becomes the focus of how you describe the module. I suppose the analysis/validation becomes secondary or moved off to the analysis module?

While the bunch of sounds or strange sound approach comes separately at first, it ideally gets integrated with either of the entropy approaches, or even the direct RNN approach, since it is complementary. (Maybe subsequent version?)

fractaldragonflies commented 4 years ago

I propose an experiment for today. Separate branch of course.

In discussion with @tresoldi, I try to organize the system into modules (I or @tresoldi ... no need to do this if already being done) :

Analysis (distribution) - largely done in command, but move apart,
Discrimination - not with new signature yet - but just porting from current code - with commands,
Validation (k-fold validation) - probably not sufficient time, but 'aim above the mark'.
Data, Markov, Plot, Stats are already ported.

This would not get us to the Method class, but maybe it would give us a better picture of the next step of integration and improvement. OK, that means I leave off RNN integration for a bit longer.

fractaldragonflies commented 4 years ago

Yes, with consideration of the very short term experiment proposed above:

As @tresoldi did the Markov models and should port them from linpgy, I suggest, @tresoldi can start here, @fractaldragonflies could do the same for RNNs, and I will do it for the strange-sound approach.

Sounds good:

I can also produce the list from the German data, maybe easier to do. So should we do this?

Or pyloan, or just loanword or. Is it convention to prefix with 'py'? Or common sense for ease of discovery?

We hae three scripts inside our "mobor" package (I'd probably consider renaming to pybor or something similar): rnn.py, markov.py, and sounds.py

LinguList commented 4 years ago

I don't like the term "loan" too much, as loan is a static thing, while borrowing is the process. We often prefix with py, and for later publication of the library it is important to have a name that is not taken.

I am now updating our library, and add a new empty package, with instructions, where you can then follow from the beginning, and with test-based code, to avoid that we have too many things at the same time in one place and too much of custom code. I use "pybor" as a working title for now.

Furthermore, the term for the function that I want should be "predict", following what they do in SVM codes, where it is also the term in scikitlearn, etc. So we train a model and ask it to predict, I think that is fair. I'll see how far I get today, maybe I can manage to have a first version to illustrate bag of symbols, but maybe it has to wait till tomorrow. In any case, I just prepared our dataset for getting started.

LinguList commented 4 years ago

I just send you all an email, @fractaldragonflies and @tresoldi. I find it crucial to consider this as a basic part of the whole library, and the starting point, namely the predict part of a given model. I use an SVM, very simple, and I classify by sounds. I did not test this any further, I just made a sparse matrix and used it. The results are not good for the German sample, but then, borrowings are a few there, so we cannot tell anything by now.

In any case: this method could now be evaluated on the whole of WOLD and compared with other methods, and this is exactly where I think we need to start.

The analysis parts that were discussed are probably much more elaborated than this simple estimate, but as I mentioned many times: if you want this to pass peer review, you need to expect that the people want to know exactly this: how well prediction works.

So I suggest, as mentioned before, and maybe this is just very trivial, but then it is actually the better (isn't it?), that you two, @tresoldi and @fractaldragonflies now do the same for Markov-Models and RNNs.

Check the code here, and the usage example here.

LinguList commented 4 years ago

Once we have this, we can discuss analysis, etc., but this is a condition sine qua non.

fractaldragonflies commented 4 years ago

Report of accuracy, precision, recall and F score are an integral part of the analysis module that we were to port.

LinguList commented 4 years ago

Okay, can we single this out? This is evaluation, right? It can be applied to all methods for classification, and all you need to do is specify the input format to arrive at the scores.

I did not have time to add the evaluation module, so if you have it already there, the better. In this way, we can use this to evaluate the accuracy of the three methods more consistently.

So you could add an evaluate.py script to the pybor/ folder and update the code that I used in the example/svm_example.py to show how it can be used?