glottolog / pyglottolog

Python API to access glottolog/glottolog
https://glottolog.org
Apache License 2.0
28 stars 5 forks source link

Create human readable classification diff #34

Open xrotwang opened 4 years ago

xrotwang commented 4 years ago

[...] maybe it would be nice with a browseable list of clf changes between versions. It's cheap since it can be automatically generated. E.g. if V1 and V2 are the older and later version respectively

  1. First generate the list of language-level inventory changes, which is (a) lgs(V1) \ lgs(V2)

    "The following languages were removed from the language inventory"

    Lg; Family(V1, lg); Macro_Area; Comment

Where comment is one of three possibilities: (i) if moved to bookkeeping: "Spurious see link to lg in V2" (ii) if promoted to subfamily: "Rendered as subfamily see link to lg in V2" (iii) if demoted to dialect: "Rendered as dialect see link to lg in V2"

and (b) lgs(V2) \ lgs(V1)

"The following languages were added to the language inventory"

Lg; Family(V2, lg); Macro_Area; Comment

Where comment is one of three possibilities: (i) if non-existent (at any level) in V1: "Added see link to lg in V2" (ii) if demoted from subfamily: "Previously rendered as subfamily see link to lg in V2" (iii) if promoted from dialect: "Previosuly rendered as dialect see link to lg in V2"

  1. For classification rearrangements, for each lg in lgs(V1) intersection lgs(V2) consider their parent paths p1 and p2 in V1 and V2 respectively. E.g., the parent path for Yaroame [yaro1235] is (yano1268, nina1239, yano1266). For each lg where p1 != p2, group on the tuple (p1, p2) and show

    "The following languages were moved"

    Lg; Family(V1/V2, lg); From-To(p1, p2); Macro_Area; Reference

Where Family(V1/V2, lg) is the Family of the lg or the string family1/family2 if they are not the same, reference is just link to clf reference (in v2), and From-To(p1, p2) can be computed as follows. Align p1 and p2 by levenshtein distance to get an aligned sequence a_1 ... a_n where each a_i = (x, y) is a pair of path elements from p1, p2 or None. (To break ties among alignments with the same Levenshtein distance, prefer the one with minimal # of substituions.) From the a_i sequence form the sequence which is

"..." if x == y "x if y == None "y" if x == None "x->y" otherwise

From-To(p1, p2) is then then the comma-separated concatenation of this latter sequence but with any sequence of consecutive "...":s replaced by just one "..."

E.g the paths

(yano1268, nina1239, yano1266) (yano1268, nina1239, yano1266, aaaa1234)

would get From-To: ..., aaaa1234

(yano1268, nina1239, yano1266) (yano1268, yano1266, aaaa1234)

would get From-To: ..., nina1239, ..., aaaa1234

(yano1268, nina1239, yano1266) (yano1268, nina1239, aaaa1234)

would get From-To: ..., yano1266->aaaa1234