glottolog / glottolog-legacy

DEPRECATED. See https://github.com/clld/glottolog
12 stars 11 forks source link

Cookbook entry showing how to extract newick trees from glottolog #106

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

Having reflected, and given that TreeMaker is an easy way to create trees from hierarchies, the easiest way with the API would be to just create the required input format. This could be done like this, with a list of glottocodes being called "mycodes" for convenience:

>>> from pyglottolog.api import Glottolog
>>> gll = Glottolog()
>>> with open('languages.txt', 'w') as f:
        for code in mycodes:
            f.write(code + '\t' + ', '.join([a.glottocode for a in gll.languoid(code).ancestors])+'\n')

Then, once the file is created, this can be easily converted with @simongreenhill's TreeMaker tool (linked above):

$ pip install TreeMaker
$ treemaker languages.txt > languages.nwk

The output is the newick file languages.nwk.

Given that many potential users are still not really aware of the power of the glottolog api, it seems like a good idea to start a little cookbook in the github repo, where things like this example (and others) are discussed and illustrated.

xrotwang commented 7 years ago

Yeah, it comes down to a documentation problem. E.g. this repository isn't the correct one :) So, this issue should be an issue in https://github.com/clld/glottolog and given how short the required code is, it may actually be appropriate in the FAQ.

xrotwang commented 7 years ago

Actually, in a python script it would probably make more sense to use TreeMaker programmatically, i.e. construct a tree calling TreeMaker.add and then writing it to file.

LinguList commented 7 years ago

Yes, I agree, I was just too lazy to read how treemaker actually works...

xrotwang commented 7 years ago

What I came up with now is the following script:7

from __future__ import print_function

from pyglottolog.api import Glottolog
from treemaker import TreeMaker
from newick import loads

def tree(*taxa):
    # We create a dict to lookup Glottolog languoids by name, ISO- or Glottocode.
    langs = {}
    for lang in Glottolog().languoids():
        if lang.iso:
            langs[lang.iso] = lang
        langs[lang.name] = lang
        langs[lang.id] = lang

    t = TreeMaker()
    for taxon in taxa:
        if taxon not in langs:
            print('unknown taxon: {0}'.format(taxon))
            continue
        t.add(taxon, ', '.join(l[1] for l in langs[taxon].lineage))
    return t

if __name__ == '__main__':
    import sys
    print(loads(tree(*sys.argv[1:]).write())[0].ascii_art())

which works as follows:

$ python tree.py deu eng Welsh Pali scot1243
           ┌─Welsh
           │          ┌─deu
           ├──────────┤
───────────┤          │          ┌─eng
           │          └──────────┤
           │                     └─scot1243
           └─Pali

This is what we want, right?

xrotwang commented 7 years ago

Done: https://github.com/clld/glottolog/tree/master/cookbook/treemaker