Test methods for correct data structure

LinguList commented 7 years ago

We have this in rudimentary form in lexstat in linpgy2, but it should be more principled, as I often run into errors in other approaches. For example, when using partial cognate annotations, the number of morphemes needs to be the same as the number of cognate ids in a given row.

I'd suggest that each major class, be it Alignments, LexStat, and Partial, define their explicit routines for checking the input. In Partial, this would be above-mentioned problem of partial cognates and morphemes. In LexStat, we would also add coverage and how well segments have been recognized, etc.

xrotwang commented 7 years ago

From my point of view, LexStat isn't really an object, but rather an operation on a Wordlist. For the IOperation interface I was already thinking of a validate method. So you'd get the operation, validate the input, then call the operation - all in a uniform way across objects and operations. Am 24.11.2016 16:13 schrieb "Johann-Mattis List" notifications@github.com:

We have this in rudimentary form in lexstat in linpgy2, but it should be more principled, as I often run into errors in other approaches. For example, when using partial cognate annotations, the number of morphemes needs to be the same as the number of cognate ids in a given row.

I'd suggest that each major class, be it Alignments, LexStat, and Partial, define their explicit routines for checking the input. In Partial, this would be above-mentioned problem of partial cognates and morphemes. In LexStat, we would also add coverage and how well segments have been recognized, etc.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lingpy/lingpy3/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKFxZXcPPt-wnEs3zf0A6DUEbUzYbks5rBamwgaJpZM4K7wIk .

LinguList commented 7 years ago

makes more sense. Also with respect to Partial, as partial re-uses two methods of lexstat, but only since it was more convenient to not re-programm them, it is a child of LexStat. I think those inheritance tutorials on object-oriented programming corrupted my thinking.

In that way, also the whole lexstat procedure will be more transparent, I hope, and it will probably even be much easier to expand on it.

The only thing we need to think about is the rather large underlying data which is produced, including scoring functions, sequence representations, etc., which are now conveniently stored in the lexstat-class-attributes and do not need to be passed around with each operation. I'd also insist that we are able to serialize all the data, as this guarantees replicability and also re-usability of code.

xrotwang commented 7 years ago

Regarding convenience: A middle-ground between re-computing all the time and storing in obscure class attributes could be using the cache more aggressively, e.g. like this:

After computing anything, pickle the object, and put it in the cache using the md5 sum as key. Re-computation could get a keyword argument md5 which would be used to lookup things in the cache first. Then you wouldn't need to re-compute all the time and not pass around the objects. - You'd have to keep track of and pass around the md5 sums, though.

xrotwang commented 7 years ago

The above could be put into the implementations of IOperation and IResult base classes and thus be pretty transparent for the user. Since knowing the md5 sum also means knowing the object - i.e. you can't get a stale result because it would have a different checksum - this procedure doesn't introduce the hard to debug behaviour we have now upon retrieving stale cache results.

xrotwang commented 7 years ago

So here's my vision for the basic operation of lingpy3: All a user needs to know is lingpy3.io, lingpy3.ops and lingpy3.interfaces. Then an explorative session could look as follows:

>>> from lingpy3 import io, ops
>>> from lingpy3.interfaces import IWordlist
>>> from clldutils.path import Path
>>> p = Path('../lingpy/test2.tsv')
>>> for name, doc in io.list_readers_doc(p, IWordlist):
...   print name, doc
... 
csv 
    Read a CSV (or TSV) file into a Wordlist object.

    :param path:
    :param kw:
    :return:

>>> wl = io.read(p, IWordlist, 'csv', delimiter='\t')
>>> for name, doc in ops.list_ops_doc(wl):
...   print name, doc
... 
distances 
    Compute a distance matrix from a wordlist.

    :param ref:
    :param refB:
    :param mode:
    :param ignore_missing:
    :param kw:
    :return:

>>> dists = ops.run(wl, 'distances')
>>> print dists
[[0, 1.0, 0.6666666666666667, 1.0, 0.6666666666666667], [1.0, 0, 0.6666666666666667, 0.0, 0.6666666666666667], [0.6666666666666667, 0.6666666666666667, 0, 0.6666666666666667, 0.0], [1.0, 0.0, 0.6666666666666667, 0, 0.6666666666666667], [0.6666666666666667, 0.6666666666666667, 0.0, 0.6666666666666667, 0]]

Now the user would play around a bit, tuning the parameters for operations, and once happy, would cache the results:

>>> from lingpy3 import cache
>>> print(cache.add(dists))
'this-would-output-the-md5-checksum'

Back in the lingpy3 script, the user could then replace the ops.run line with:

>>> dists = ops.run(wl, 'distances', _checksum='this-would-output-the-md5-checksum')

and thus make sure, when he runs the script the next time in his environment, the actual computation wouldn't have to happen.

If the script is run in a different environment, no results would be found in the cache, thus all computations would be run.

LinguList commented 7 years ago

Yep, this should work nicely. For publication purposes, or for interactive sessions on different computers, etc., the user would of course also have the possibility to write things to file, right? This was important in some earlier publications, where I would share the scoring functions obtained from randomization. In a proper LexStat analysis, for example, one will have at least the following values created (usually automatically, but the COULD also be passed manually):

simple scorer, consisting of unique characters taken out of context
scorer created from random distributions of characters IN context
list of characters which occur uniquely (sound classes)
list of characters in each language in separation and in context (the triples "1.T.A", which are created in LexStat, etc.)
frequencies of characters (that is, the lists above are not a list, but more a Counter or similar)
list of sequence pairs used in the calculation (lexstat excludes identical words in the same language as a method of smoothing, to avoid overcounting of colexifications, but only when creating the distribution for the scoring function)
of course the parameters which were used when creating, e.g., a scoring function, as in LexStat.get_scorer()
list of the numerical lingpy-internal representation of "segments" consisting of the triples (same as in the scorer)
etc...

I guess serializing is straightforward in most cases: matrix for scoring dictionaries is already in the SoundClass models, json for the frequencies, and if trees are used, newick for trees. But serialization (maybe zipping each lexstat analysis, if the user whiches an output), would make sure that we can replicate across machines, etc., and also look into the real data independent of running a Python script.

In scikitlearn, they don't offer serialization for SVMs. This is extremely annoying, since users often use a week to create them on their cluster. Our LingPy analyses will be harmless, and easy to run just another time, but I would still prefer having the crucial things being all, e.g., in a zip file, now that the wordlist-format allows only for tsv.

But that won't be a real problem, to allow for a zipped i/o right?

xrotwang commented 7 years ago

Ok, so when you say "serialization" you also mean "de-serialization", right? I guess JSON should be good enough for most of our needs. So I'd propose a simple protocol: For an object to be serializable it must implement

a __json__ method, returning something that can be dumped as JSON,
a __from_json__ classmethod which will recreate an instance form JSON.

So for a Wordlist instance this JSON could just be

{
    "header": [...],
    "rows": [...],
    "kwargs": {...}
}

If we go for this approach, I'd do away with pickling for the cache entirely (or only provide it as fallback for non-JSON-serializable stuff), and instead, put JSON in the cache.

xrotwang commented 7 years ago

Actually, considering the undefined order of dict keys, I'd propose to always use lists of pairs instead of dicts for serialization. So this would be:

[
    ["header", [...]],
    ["rows", [...]],
    ["kwargs", [["id_", "..."], ["concept", "..."]]]
]

If reading and writing of these files is centralized, we can still combine this with the checksum approach.

LinguList commented 7 years ago

Yep, that makes sense, and there probably won't be any instance where json would not do.

xrotwang commented 7 years ago

Scrap the "list of pairs" idea, though. This would make it impossible read the JSON generically, because we cannot simply always read lists of pairs as dict. So instead, we should probably sort any ordinary dict for serialization and then serialize it as OrderedDict.

LinguList commented 7 years ago

Is that possible from within the json-api? It should be, I guess, as json is linear as a format...

xrotwang commented 7 years ago

Yes, one can simply pass sort_keys=True to json.dump.

lingpy / lingpy3

Test methods for correct data structure #14