Proposal for a wordlist implementation

xrotwang commented 7 years ago

The propsed implementation is somewhat close to the current one in that it stores the data as list of lists. The API to access data has been changed considerably, though. While sometimes more verbose, I think it is also more powerful.

Below is a translation of the old API to the new one.

LingPy 2.5:
```
from lingpy import Wordlist
```

wl = Wordlist('test.tsv') print wl[1] print wl[1, 'concept'] print wl.ipa print wl.language print wl.concept print wl.get_dict(col="l1",entry="ipa") print wl.get_list(row="foot",entry="cognates",flat=True)


- LingPy3:
```python
from clldutils.dsv import reader
from lingpy3.basic.wordlist import Wordlist

rows = list(reader('../lingpy/test.tsv', delimiter='\t'))
wl = Wordlist(rows[0], rows[1:])
print wl[1]
print wl[1, 'concept']
print [rows for _, rows in wl.get_by_concept('ipa')]
print wl.languages
print wl.concepts
print wl.get_dict_by_concept('ipa', language='l1')
print wl.get_slices('cognates', rows=wl.filter(concept='foot'))

The API to add columns has been changed, too. The callable passed into add_col receives the data of a row as dict and must return the value for the new column:

wl.add_col('x', lambda i: i['ipa'] + i['cognates'])
print wl.get_slices('x')

LinguList commented 7 years ago

This looks very promising to me. I propose, also to avoid that we work on the same things at the same time, that I'll focus first on the sound_classes.py script, which contains code like ipa2tokens, but also useful other string functions. I'll split this into two: one sound-class related script and one related to "utility functions". E.g., the clean_string function will be useful for lexibank, while the other function will be more important for lingpy-internal things.

LinguList commented 7 years ago

Feel free to merge anytime. I won't merge immediately (in case you still want to add things).

xrotwang commented 7 years ago

Just added an implementation of get_etymdict, which shows in my opinion the power of the new API.

xrotwang commented 7 years ago

Btw.: Is there a good reason to have 0 as marker for "no counterpart in language" in an etymdict, rather than []? I'd think it's weird to work with a data structure, where you'd also have to check types before doing anything.

LinguList commented 7 years ago

No, you're right, there's no reason for this, just my reduced knowledge of Python a few years ago, and given that the output for each taxon regarding a cognateset is always a list, as we may have synonymous entries which are cognate (or cross-semantic cognate models), an empty list is much more natural.

LinguList commented 7 years ago

BTW: for the partial cognate set case, an etymological dictionary would need to split the content of the reference. Given what we discussed yesterday, it seems more straightforward to have a second function called partial_etymdict (or similar) doing this job, right?

lingpy / lingpy3

Proposal for a wordlist implementation #5