glottolog / glottolog-cldf

Glottolog data as CLDF StructureDataset
https://glottolog.org
Creative Commons Attribution 4.0 International
13 stars 3 forks source link

Add homeland coords for language groups ... #14

Closed xrotwang closed 1 year ago

xrotwang commented 1 year ago

... computed as "minimal distance" as described in "Testing methods of linguistic homeland detection using synthetic data" https://doi.org/10.1098/rstb.2020.0202

xrotwang commented 1 year ago

@d97hah This might be a useful addition as far as

Thoughts?

xrotwang commented 1 year ago

Implementation:

"""
Testing methods of linguistic homeland detection using synthetic data
Søren Wichmann and Taraka Rama
https://doi.org/10.1098/rstb.2020.0202
"""
import random
import collections

import pyproj
from pycldf import Dataset

def geodist(geod, p1, p2):
    return geod.inv(p1[1], p1[0], p2[1], p2[0])[2]

def md(g, coords):
    if len(coords) == 1:
        return coords[0]
    if len(coords) == 2:
        return random.choice(coords)
    random.shuffle(coords)
    mindist, mincoord = None, None

    for i, coord in enumerate(coords):
        dist = sum(geodist(g, coord, p) for j, p in enumerate(coords) if i != j)
        if (mindist is None) or (dist < mindist):
            mindist, mincoord = dist, coord
    return mincoord

if __name__ == '__main__':
    cldf = Dataset.from_metadata('cldf/cldf-metadata.json')
    subgroups = collections.defaultdict(list)
    vals = [v for v in cldf['ValueTable'] if v['Parameter_ID'] in ['level', 'classification']]

    # Find language-level languoids with coordinates:
    languages = {
        v['Language_ID'] for v in vals
        if v['Parameter_ID'] == 'level' and v['Code_ID'] == 'level-language'}
    languages_with_coords = {
        l['ID']: (l['Latitude'], l['Longitude'])
        for l in cldf['LanguageTable'] if l['ID'] in languages and l['Latitude'] is not None}
    names = {l['ID']: l['Name'] for l in cldf['LanguageTable']}

    # Collect sets of languages per language family/subgroup:
    for v in vals:
        if (v['Language_ID'] in languages_with_coords) and (v['Parameter_ID'] == 'classification'):
            clf = v['Value'].split('/')
            for i, _ in enumerate(clf, start=1):
                subgroups[' '.join(clf[:i])].append(languages_with_coords[v['Language_ID']])

    # Compute minimal distances per group:
    g = pyproj.Geod(ellps='WGS84')
    for group, coords in sorted(subgroups.items(), key=lambda i: ','.join(names[gc] for gc in i[0].split())):
        print('{}\t{}\t{}\t{}'.format(*[','.join(names[gc] for gc in group.split())] + list(md(g, coords)) + [len(coords)]))
d97hah commented 1 year ago

The classic principle of max diversity seems to have these properties but with better principles, i.e., in practice:

all the best, H

Pada tanggal Kam, 13 Okt 2022 pukul 09.18 Robert Forkel < @.***> menulis:

@d97hah https://github.com/d97hah This might be a useful addition as far as

  • it adds geocoords for each family languoid
  • it does so in a cheap yet principled way

Thoughts?

— Reply to this email directly, view it on GitHub https://github.com/glottolog/glottolog-cldf/issues/14#issuecomment-1277140656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QHSD2BLCW6BJLOKIBIEDWC6ZTTANCNFSM6AAAAAARD64SS4 . You are receiving this because you were mentioned.Message ID: @.***>

xrotwang commented 1 year ago

The "take the nearest point on land" bit seems non-trivial, though. The "minimal distance" approach gets around this in a somewhat crude way - but it comes with a citation to justify the approach :)

xrotwang commented 1 year ago

Arguably, though, having a coordinate for a language group on water, isn't completely non-sensical. It comes with the "the ancestors passed through here" connotation.

Anyway, I think it wouldn't be a problem to add more ways to infer homelands lateron. (We'd like to get coordinates for proto-languages right now for Steve's bdproto - and the "minimal distance" ones would do the trick :) )

xrotwang commented 1 year ago

I'll try to whip up a "max diversity" algorithm accepting locations on water and plot the results.

xrotwang commented 1 year ago

Found a fairly simple (and amazingly fast) solution to the "nearest point on land" problem. Will send around results soon.

d97hah commented 1 year ago

Great, I have something similar implemented somewhere but you, as usual, were faster! H

Pada tanggal Jum, 14 Okt 2022 pukul 09.53 Robert Forkel < @.***> menulis:

Found a fairly simple (and amazingly fast) solution to the "nearest point on land" problem. Will send around results soon.

— Reply to this email directly, view it on GitHub https://github.com/glottolog/glottolog-cldf/issues/14#issuecomment-1278627159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QHSDFBCNJG4MNYENTUB3WDEGQZANCNFSM6AAAAAARD64SS4 . You are receiving this because you were mentioned.Message ID: @.***>

xrotwang commented 1 year ago

My implementation is basically this:

    from shapely.geometry import MultiPoint
    from shapely.ops import nearest_points

    cent = MultiPoint([(p[1] + 360 if p[1] < 0 else p[1], p[0]) for p in coords]).convex_hull.centroid
    homeland = nearest_points(land, cent)[0]
    return (homeland.y, homeland.x - 360 if homeland.x > 180 else homeland.x)

where land is the MultiPolygon from https://geojson-maps.ash.ms/

xrotwang commented 1 year ago

@d97hah to get rid of the weirdest effects of languages like Hunsrik, one could limit the eligible languages for homeland computation to certain combinations of macroareas. Do you think that makes sense? E.g. exclude the Americas whenever there's other macroareas involved?

xrotwang commented 1 year ago

Looking over the families, a simple blacklist of macroareas for Atlantic-Congo, Austronesian and Indo-European would probably be enough, though.

xrotwang commented 1 year ago

@d97hah do you have a citation for this method of computing homelands? It isn't exactly the "diversity" method presented in Wichmann 2015.

xrotwang commented 1 year ago

Wikipedia sees to cite Campbell 2013 for this.

d97hah commented 1 year ago

It goes back to Sapir and there's also a botanical analogue by someone called Vavrilov (?, from memory). Neither has the 'in water' twist. It's enough to cite Campbell 2013 I think, which should contain refs to the history.

Pada tanggal Jum, 14 Okt 2022 pukul 15.55 Robert Forkel < @.***> menulis:

Wikipedia sees to cite Campbell 2013 https://glottolog.org/resource/reference/id/547852 for this.

— Reply to this email directly, view it on GitHub https://github.com/glottolog/glottolog-cldf/issues/14#issuecomment-1279044424, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QHSEJFEGO5PILRT4WGA3WDFQ5ZANCNFSM6AAAAAARD64SS4 . You are receiving this because you were mentioned.Message ID: @.***>

xrotwang commented 1 year ago

Ok, I'll update #15 to include these homelands.