basic wordlist statistics: external functions or built into the main class?

LinguList commented 7 years ago

There are a couple of interesting metrics we want to have at hand when dealing with wordlists:

diversity (as the measure proposed in my diss, but I have ideas for enhancement, including partial cognates)
synonymity or semantic diversity of cognate judgments (we don't care at the moment, but with more cross-semantically coded cognate sets, it will be interesting to calculate how many meanings a cognate set has on average)
colexification coefficient (not clear for now, but we should have a metric that gives us a numeric impression on how pervasive colexification is)
coverage measures (how many gaps do we have in the data, good metrics pending, but also important for automatic cognate detection)

Should we create some extra class or a script that offers these metrics and can be applied to any wordlist object, or should we built them into the wordlist base class itself? Note also that we will always need to define both a normal and a partial version for the metrics, as partial cognate sets become increasingly available.

LinguList commented 7 years ago

I suppose we use the Wordlist Metrics project to discuss the metrics we want in more detail. Ideally, each metric should have a concise description in the documentation, or we could provide a full summary on the metrics.

xrotwang commented 7 years ago

My current thinking - also with regards to other calculations on wordlists like distances - is to do this with adapters again :) I.e. there would be adapters which adapt IWordlist to the interface IOperation:

class IOperation(Interface):
    def __call__(self, *agrs, **kw):
        pass

One of the advantages would again be the central registry aspect: It would be easy to enumerate all operations available on a lingpy3 object; but also the pluggability aspect: Any operation could easily be swapped with a different implementation.

LinguList commented 7 years ago

Sounds convincing. And one further aspect is that not all wordlist objects may be amenable to all those calculations, so the Wordlist would be kept lightweight and controllable, and we can put more efforts into a good reasoning behind the metrics. This is, by the way, important, as there are not many metrics which have been standardized for wordlists, but many of them may have important implications for linguistic reconstruction, such as, synonymity (current problem of IELex vs. COBL), coverage (general problem with Hittite), but also colexification.

xrotwang commented 7 years ago

Yes, a lightweight wordlist is one of the goals. So I would never store any results of operations within the wordlist object - if the user wants to store intermediate results across sessions, the cache is there! So I'd imagine a more explorative workflow of the form:

>>> wl = read(p, IWordlist, 'csv')
>>> for name, op in lingpy3.ops.iter_operations(wl):
...    print(name, op.__doc__)
>>> op = lingpy3.ops.get_operation(wl, 'distances')
>>> dists = op(...)
>>> for name, op in lingpy3.ops.iter_operations(dists):
...    print(name, op.__doc__)

xrotwang commented 7 years ago

I.e. each operation returns an object, which you can use to look up the next batch of operations (or any registered writers) for.

LinguList commented 7 years ago

Okay, so while it seems to be straightforward to compute, for example, the cognate density or diversity, I suppose it won't do us any harm to carefully think about those different operations. The distances, for example, are classical language-language comparison, and return a distance matrix, which can then be written to nexus or phylip, or used to calculate a neighbor-joining tree. The metrics, like the cognate density, are intuitively quite different, as they are supposed to characterize the data, rather than interpeting it, and they usually return only a score. I suppose we try to collect the metrics we ant using the project on Wordlist Metrics (I'll have quite a lot to say on this and will also try to point in each case what lingpy2 is doing here). Implementing this should be rather straightforward, especially with the new Wordlist class, which has really convinced me when testing it. The documentation should have one chapter on wordlist metrics (just assigned this to me in this issue #12.

lingpy / lingpy3

basic wordlist statistics: external functions or built into the main class? #11