lingpy / pybor

A Python library for borrowing detection based on lexical language models
Apache License 2.0
3 stars 1 forks source link

update code #1

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

@fractaldragonflies and @tresoldi, please check this example illustrating how I imagine that this code package should work.

  1. we add a convenient function to load the data as a wordlist (no saving of data in tables nowhere)
  2. we separate plots from analysis
  3. we also use the ngrams code that @tresoldi wrote from lingpy (as I know this better than nltk)

The result is:

1 convenient loading of a wordlist (see mobor.data.Wordlist.from_lexibank)

>>> from mobor.data import Wordlist
>>> wl = Wordlist.from_lexibank('wold', fields=['loan', 'borrowed'], fieldfunctions={"borrowed": lambda x: (int(x[0])*-1+5)/4})

2 simple extraction of a table

table = wl.get_language(
        'English',
        [
            'concept', 
            'form', 
            'tokens', 
            'sca', 
            'borrowed',
            'loan'],
        dtypes = [str, str, str, lambda x: ' '.join(x), 
            lambda x: '{0:.2f}'.format(x), str]
        )
print(tabulate.tabulate(table[:20], headers=['id', 'concept', 'form', 'tokens', 'sca', 'borrowed',
    'loan'], tablefmt='pipe'))

3 a new class Markov that can retrieve data from a wordlist

from mobor.markov import Markov
mk = Markov(
        wl, 
        'English', 
        ['concept', 'form', 'tokens', 'sca', 'borrowed', 'loan'],
        dtypes = [str, list, list, list, float, bool]
        )
mk.add_sequences(
        [row['tokens'] for row in mk.now['dicts']])
mk.train(normalize='laplace')

4 a specific plotting module that only does this: plotting data

from mobor.plots import plot_word_distributions
# retrieve distribution for borrowed words
borrowed, unborrowed = [], []
for row in mk.now['dicts']:
    if row['loan']:
        borrowed += [mk.entropy(row['tokens'])]
    else:
        unborrowed += [mk.entropy(row['tokens'])]

# plot the distribution
plot_word_distributions(borrowed, unborrowed, 'test.pdf')
LinguList commented 4 years ago

I would say: please let us work in this direction, to make the code cleaner, library based. we can even add a command-line interface that handles parameters. But the important part would be: let us determine what the core tests are that we want to do (e.g., compare word distributions) and see that we do this with the minimal number of effort from heavy third-party libraries.

Note that the setup.py will allow for full replicability, as we can add all thirdparty libs there, but we should reduce them (ideally not using nltk, but @tresoldi and @fractaldragonflies, you need to see if the NgramModel by @tresoldi is enough to account for the markov experiments).

LinguList commented 4 years ago

Update, I just added command line functionality:

mobor plot_entropy --language=Swahili --file=swahili-entropy.pdf --sequence=sca

will plot the entropies for sca, and the like.

In this spirit, I think, we should carry on doing the whole code.

LinguList commented 4 years ago

@tresoldi, if you look at the markov code by @fractaldragonflies : are there things that nltk offers which we can't handle with your ngrams? If not: how difficult would it be to reimplement them, or how important would these be? I think I'd suggest to add them to the new wrapper (so we don't add them to lingpy,which woudl require more testing), so we can make our experiments now with this new library.

LinguList commented 4 years ago

I will leave this open until @fractaldragonflies had a look, you can in fact just merge then, and see in which way more code could be integrated.

tresoldi commented 4 years ago

The ngram functions in lingpy were written as an extension of the ones in nltk: everything in there should be compatible (perhaps with minor differences in calling paradigm and things like that). So much that, at the time, I was confirming the output with the one provided by nltk.

Granted, nltk might have changed in these two years, but it seems everything should work out of the box. @fractaldragonflies , could you confirm? The lingpy functions are pretty well documented.

fractaldragonflies commented 4 years ago

A bit overwhelming at first. And a bit at second glance as well. Also I will need a bit of schooling with respect to GitHub interaction -- I think you are awaiting some action/approval for my part.

Comments on Mattis proposal results:

  1. Convenient load form word list. Agreed. Already I am learning more of WordList capability just in the example.
  2. Simple extraction of table. Agreed. Replaces construction and storage of individual tables. Again learning more of WordList capability.
  3. New class Markov that can retrieve data from WordList. Agreed. Would invoke the simple extraction from 2. This doesn't prevent functionality (future maybe) where the data source is other than WordList format. Use of WordList does stick users more to the current ecosystem.
  4. Specific plotting module that does just plotting. Agreed with a consideration. Invoked by an instance of Markov which makes sense (so far) as the plot is of Markov computed entropy distributions. Consideration: Neural net entropy model also would generate such plots. So the use of the plotting function is more a 'protocol' which is implemented both for Markov and Neural models.

Comments on how it should work:

  1. Convenient function for loading data. Agreed. Illustrated well in the results.
  2. Separate plots. Agreed with consideration that it would likely be 'protocol' for implementation under various entropy models.
  3. Use ngram model from @tresoldi. Tiago vouches that the model should be consistent. In the example from 3. above, I see the ngram model using 'laplace' normalization whereas in the modeling I used 'Kneser Ney'. Are both options available? NLTK3.5 corrects an error where modeling entropy to new tokens failed for the 3-gram (2nd order conditional) model when the 3-gram hadn't been seen before. Some verification required of the model from @tresoldi, but I am not opposed as long as we get similar results. Future possibility to support other models.

With respect to supporting other models, this becomes a more general theme if we consider the neural network model as well. For discussion.

LinguList commented 4 years ago

I will merge this then for now. So you should run

$ git pull

in the folder, and will have the updates.

The plotting function can be modularized more. It is only a first example to get things running. The advantage is: it will check for dependencies now, so it is much more replicable, and we really separate data and code, will be able to add other lists in the future, etc.

@tresoldi, please, can you look into the question of normalization on laplace, etc? I didn't find the one that @fractaldragonflies used.