Wordlist Specification - Githubissues

LinguList commented 8 years ago

My earlier thoughts and reports on functionality, which have not changed in large parts, are given here:

http://lingpy.org/tutorial/lingpy.basic.wordlist.html

Generally, one should be able to trigger output like this:

>>> wl.ipa
[['wɔldemɔrt', 'valdemar', 'vladimir', 'volodimir'],
 ['hæri', 'haralt', 'gari', 'hari'],
 ['lɛg', 'bain', 'noga', 'noha'],
 ['hænd', 'hant', 'ruka', 'ruka']]
>>> wl.cognate
[[6, 6, 6, 6], [7, 7, 7, 7], [4, 3, 5, 5], [1, 1, 2, 2]]

Based on a file like this:

ID   CONCEPT     COUNTERPART   IPA         DOCULECT     COGID
1    hand        Hand          hant        German       1
2    hand        hand          hænd        English      1
3    hand        рука          ruka        Russian      2
4    hand        рука          ruka        Ukrainian    2
5    leg         Bein          bain        German       3
6    leg         leg           lɛg         English      4
7    leg         нога          noga        Russian      5
8    leg         нога          noha        Ukrainian    5
9    Woldemort   Waldemar      valdemar    German       6
10   Woldemort   Woldemort     wɔldemɔrt   English      6
11   Woldemort   Владимир      vladimir    Russian      6
12   Woldemort   Володимир     volodimir   Ukrainian    6
13    Harry       Harald        haralt      German       7
14   Harry       Harry         hæri        English      7
15   Harry       Гарри         gari        Russian      7
16   Harry       Гаррi         hari        Ukrainian    7

But now, that I'm smarter than in the past, I would not make this a class-attribute, as it has lead to inconsistencies in the current lingpy. It is also not consequent, as far as language and concept return one-dimensional lists so far.

LinguList commented 8 years ago

In the retrospective, I would also say now, that there are only two basic types of cells in a wordlist (but both need to be supported): string and list of strings. List of strings would be needed for segments (tokens), and partial cognate sets, and be produced by the function lambda x: x.split(' '). Strings for all the rest.

While we could fix the behaviour easily by column name, we need to be aware that we need still some flexibility, as people may want to work with two different kinds of partial cognates, expert vs. machine, and also different kinds of phonetic segments. So the __init__ function should allow the user to specify the content of a column.

Integers are used in the current lingpy for various operations, but one can potentially ignore them. They are, however, useful in creating cognate identifiers, as one can make sure that they are unique, by just incrementing a number, and this is done in a couple of functions, like, e.g., wordlist.renumber('column'), which play an important role at the moment. The question is, whether integers should be added as a third cell-data-type (with a corresponding list of integers for partial cognates), or whether a similar behaviour could be triggered by using strings. An advantage of integer cognate IDs is that one can easily add a new one which has not yet been used by just incrementing the max(cogids). If there's a workaround for similar behaviour in strings, this would spare us to add integers as a cell-data-format.

SimonGreenhill commented 8 years ago

@LinguList this strikes me as a good case for a lightweight class with inheritance e.g.:


class BaseCell(object):
    def __init__(self, value):
        self.value = value

    def __str__(self):
        return self.value

    def __bytes__(self):
        return self.decode('utf8')

class StringCell(BaseCell):
    pass

class ListCell(BaseCell):
    def __init__(self, value):
        self.value = [_.strip() for _ in value.split(",") if len(_.strip())]

    def __repr__(self):
        return ", ".join(self.value)

This would then allow people to subclass and make their own cell types (and an IntegerCell is easy to define).

Adding a lt etc would make it sortable (and a check for integer-icity could happen in there if wanted). So then sorting a wordlist is just a matter of using sorted()

Alternatively, the data could be analysed row-wise e.g.:

DEFAULT_CELLS = {
       'ID': StringCell,
       'CONCEPT': StringCell,
       'COUNTERPART': StringCell,
       'IPA': StringCell,
       'DOCULECT': StringCell,
       'COGID': StringCell,
}

NECESSARY_CELLS = ['ID', 'CONCEPT', 'COUNTERPART']

class Record(object):

    def __init__(self, **values):
        for k, v in values.items():
             # find cell type and instantiate. Default to StringCell if we don't know
             cell = DEFAULT_CELLS.get(k, StringCell)(v)
             setattr(self, k, cell)

        for k in NECESSARY_CELLS:
            assert getattr(self, k, None) is not None, 'Record needs a %s' % k

LinguList commented 8 years ago

This looks very convincing with the basic cell hierarchy. I wonder if this might slow down the processing, as @xrotwang now confirmed that defaultdicts really slow down processing compared to plain lists (which is the reason why I originally used the strange dictionary-with-lists-structure for lingpy). But if this is only used for initialization and updating, it might be the approach of choice. And I think if we agree on those either two (strings + list of strings) or the four (strings + list of strings, ints + list of ints) basic types, also with respect to cldf maybe, it should be already a great improvement over lingpy's current flexibility that allows to define everything and nothing.

SimonGreenhill commented 8 years ago

premature optimisation is the root of all evil :)

Sure, the hierarchy here could be overkill. Perhaps losing the BaseCell/StringCell/ListCell distinction and just having a single object 'Cell' with logic in it could be enough, or just using python base objects e.g. string/list?

Thinking more about it though, this would be natural place to have transcoders and validators e.g.:


class CLPACell(StringCell):
     def __init__(self, value):
         self._is_valid(value)  # raises CLPAValidationError on Invalid SAMPA input
         self.value = self.to_clpa(value)

The costly steps are at the initialisation stage (anything else expensive can be cached or memoized). In terms of speed, wordlist and all sound class converters etc just need to know how to get the value from a cell (mycell.value).

LinguList commented 8 years ago

Seems convincing to me. In the beginning, I was also thinking of having a "sequence" class, which I discarded at some point, but validators are surely needed, also for the CLPA enterprise, and currently, in lingpy, validation of normal sound classes is also not carried out in an expressively transparent way, so having one loose lingpy evaluation and one more strict clpa evaluation (the latter as an external function from pyclpa) would be definitely useful.

xrotwang commented 8 years ago

If we would really want to follow this path, i.e. build functionality into row objects or even cell objects, I'd use the attrs library. This would give us lightweight objects with validation, conversion and representation as dicts. But I'm not sure people would want to customize wordlists on the cell level, rather than provide completely new wordlist implementations. Am 21.11.2016 16:17 schrieb "Johann-Mattis List" notifications@github.com:

Seems convincing to me. In the beginning, I was also thinking of having a "sequence" class, which I discarded at some point, but validators are surely needed, also for the CLPA enterprise, and currently, in lingpy, validation of normal sound classes is also not carried out in an expressively transparent way, so having one loose lingpy evaluation and one more strict clpa evaluation (the latter as an external function from pyclpa) would be definitely useful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lingpy/lingpy3/issues/2#issuecomment-261965908, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKPyLwKQ9y0E3MKfE_ZhKnqK_oUeLks5rAbYlgaJpZM4K38qV .

LinguList commented 8 years ago

so, the most important aspect is the two (str + list) or four (str + list, int + list) types. These are essential, and without these, it won't work, and users need to be able to specify themselves, which columns belong to which type when they load a wordlist, potentially assuming standard datatypes for column names, such as "segments", "doculect", "concept", "cognate_set", etc.

xrotwang commented 8 years ago

I'm a bit confused regarding what kind of flexibility is required here. From the point of view of LingPy I guess the type of data in a cell is specified by the way the cell is used; or stating it the other way round: If a user calls a LingPy method specifying column tokens as the one to look for segmented data, then LingPy will run value.split() to get the tokens, right?

So the way I see it, we should not attach any behaviour to assumed standard column names (I think the current practice of passing the column name for specific kinds of data into functions is good (and default values for these names - like segments='segments' - are ok, too)). And we should also not bloat the wordlist spec by adding type information. What we should have is a single place where the conversion functions for the four types are defined, but it should be the responsibility of each method operating on wordlists to call these conversion functions.

LinguList commented 8 years ago

Okay, but we have one problem with diversity of column names due to different datatypes here. For cognate detection, we create cognate-sets, be the traditional "cogid", or the partial identifiers. This defaults to:

cogid: user-defined
scaid: sca-algorithm
lexstatid: lexstat algorithm
editid: edit-distance algorithm
turchinid: turchin algorithm

But this is not the only usecase, as we need to allow to specify the column name here, as we man want to (and we do and have done) run several analyses with different thresholds, in which these "references" (keyword "ref" in Alignments, and LexStat) are usually flexibliy modified, to allow for n different columns filled with alternative cognate sets.

Normally, we can live by having strings for these values, but for partial cognates, we cannot use strings, similarly for segment-like data, like

prostrings
tokens (classical segments)
classes
weights (actually: a float-separated value, needed for the weights in lexstat calculations) etc.

I don't really know how to best handle this. One could use a wildcard: all that ends in ID is integer, all that ends in IDS is [int(x) for x in y.split(' ')]. Any other idea how to handle this in a principled way which leaves us enough flexibilty, especially to test cognate detection algorithms?

xrotwang commented 8 years ago

I don't follow. From my point of view it is solely the callers responsibility to make sure the data in the wordlist cells matches the requirements of the algorithms it is specified for. So if I want to run an algorithm that expects a cell of list type, I have to make sure to pass a suitable column name. Basically, I think this is simply a documentation issue, i.e. all methods operating on wordlists must make it explicit which type of data they expect in which cells.

LinguList commented 8 years ago

ah, I see, this makes much more sense: so this would mean, that the data is, e.g., in ANY format in teh wordlist, but if you call it, constructing, for example, a partial etymolgoy dictionary, the caller will assume that this is some specific list-format, right?

I never looked at it from this side, but it makes the whole thing much, much easier, I think...

xrotwang commented 8 years ago

The callee, i.e. the called method makes the assumption, the caller makes the promise - but yes, I think we are now on the same page.

LinguList commented 8 years ago

Yes, callee, I tend to confuse these things. Makes complete sense, and will probably incredibly clean up the current mess ;-)

lingpy / lingpy3

Wordlist Specification #2