lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Adding geography support #55

Closed lmaurits closed 8 years ago

lmaurits commented 8 years ago

Now that non-strict clocks are looking largely complete, the next major functionality hurdle is to add support for geography to BEASTling.

The plan is for this to be largely powered by yet more integration with Glottolog, which will be able to provide a latitude/longitude point and a macro-area for a great many languages. Of course it should be possible for users to provide their own file of latitude/longitude points, but if no such file is provided then the default is to use Glottolog data.

I envisage two aspects to the geography support.

The first, which is fairly straightforward, is the ability to use geography for filtering the languages in an analysis. At the moment, we have the ability to specify a list of families by name or glottocode, and also the ability to just specify a list of languages. It would be great it filtering could also be achieved by supplying a list of macro-areas. I am open to further filtering support, e.g. filtering by Tsammalex ecoregion, or, even allowing the user to define their own latitude/longitude-based polygon shape to select languages from (assuming there is adequate and lightweight enough Python library support to make this possible), but filtering by macro-area should be quick and easy and so makes sense as a good first target.

The second is to actually add a phylogeographic component to the inference on whatever languages make it through the filter. It does not seem like this will be too hard to do once we have access to a list of latitude/longitude features.

As per the clock situation, I guess most of the planning/discussion which needs to happen is surrounding configuration.

It seems sensible to me that language filtering by macro-area should happen in the [languages] section, i.e. in the same place as all other filtering takes place. A simple macro-areas = option should get the job done.

As for adding phylogeography, a new [geography] section seems appropriate. I had imagined this containing a type= option to choose between spherical or planar diffusion (with BEASTling making the decision for the user based on the size of the region languages are spread over), however I've now been told that the planar model is never faster than the spherical model and there is essentially no reason to ever use it. So for now the only option that needs to be supported is a clock= option so that the geographic part of the likelihood can be associated with a clock, exactly as per the current handling of models. I imagine further options will certainly appear as support expands, though.

An open question is, if the user wants to specify their own file of location data, where should this be done? The information would be used for both language filtering (done in [languages]) and for phylogeography (done in [geography]), so there's some freedom in where it would go. I suppose [languages] makes the most sense - actually, it's the only thing that makes sense, as we want to support geographic filtering for analyses which don't actually have a phylogeographic component to the analysis (and hence have no [geography] section).

Does anybody know if there is a widely-used and supported standard file format for exchanging latitude and longitude data? Or, let me guess, there are seven such standards?

Anaphory commented 8 years ago

Conceptually, a [geography] section is a [model] section with different defaults (“Data source: Get location data from glottolog, supplemented/overridden by the data given in the [languages] section; substitution model: random walk on spheres”), is it not?

lmaurits commented 8 years ago

Yes, pretty much. BEASTling's [model]s more-or-less correspond to BEAUti's "partitions", and geography is implemented in BEAUti as an additional partition.

Are you suggesting we do something like

[model mygeomodel]
model = geography

?

Anaphory commented 8 years ago

I'm suggesting that the config file parser handles

[geography]

as a short way of writing something like

[model geography]
data = $(beastling-data)/glottolog-languoid.csv
traits = latitude,longitude
model = geosphere

or whatever configuration options we end up with for it.

Anaphory commented 8 years ago

Glottolog distributes lat/lon data as part of http://glottolog.org/static/download/glottolog-languoid.csv.zip, in two columns "latitude" and "longitude" with decimal-point number values each.

lmaurits commented 8 years ago

Currently, yes. @xrotwang and I have spoken about adding new URLs for per-release lat/lon and macro-area data in unzipped .csv format, so it can be easily distributed with and/or downloaded by BEASTling, just like the Newick data currently is.

xrotwang commented 8 years ago

@Anaphory rather than data = .../glottolog-languoid.csv, this should already be inferred from the global glottolog-release option. Maybe a data option would make sense for other, less integrated providers of geographic data.

Anaphory commented 8 years ago

Yes!

I think it should set default values that can be overridden, and given that we already have support for the glottolog trees, getting the glottolog-languoid.csv should be similar.

lmaurits commented 8 years ago

I did a little bit of work on this today. In order to be able to test things, I added a mock load_glotto_geo() method to Configuration (cf load_glotto_classification). The current implementation assigns random (lat,lon) points and random macro-areas to all languages. @xrotwang, when you end up adding the Glottolog integration, as long as you assign the data dictionaries to the same attribute names as I have used in the mock method, then filtering by macro-area and [geography] sections should still work.

lmaurits commented 8 years ago

The spherical diffusion "substitution" model needs a clock associated with it, just like all our more conventional substitution models. The current behaviour is exactly as per a normal model, i.e. you can explicitly assign any clock you like with a clock= option, but if you do nothing it will use the (possibly implicit) [clock default].

Do we want to keep this behaviour? It is "the simplest thing that could possibly work", but it's also rather a bad approach in general to share a clock between geography and "real data", even if that clock is relaxed or random (there is no reason to expect the variation in rate for these two types to be correlated, and Remco has informed me that when independent clocks are used for data and geography in his Indo-European analyses, there is no evidence of correlation (and much more variation in the geographic clock than the other).

Then again, we have not so far pursued a policy of the default model being a good model, just a simple model. I'm also wary of the alternative - a second implicit, overridable clock [clock geo], just because I think it's a bad idea to have too much invisible stuff happening behind the scenes.

Thoughts?

lmaurits commented 8 years ago

Geographic clock handling is now a bit smarter. If the geography model is sharing a clock with a data model, then the geography's precision parameter is scaled to compensate (otherwise the data and geography will fight to set the mean of the shared clock, assuming said mean is estimated. If the clock is fixed, things will just fit really, really poorly). But if geography gets its own clock, then the precision is fixed and the clock is scaled instead.

lmaurits commented 8 years ago

I have just committed a really quick and dirty solution for importing location data. You can specify location_data = <filename> in the [languages] section. The expected format is CSV, with a header (which is totally ignored and may have any content at all, but should exist if you don't want your first language's data to vanish), and then rows of identifier, lat, lon, where identifier should be an ISO or Glottocode and lat and lon are floats.

I do not propose this as anything remotely like the final solution for this, I just wanted something in there so that I could run analyses using real Glottolog data (instead of the random mock data which is automatically used right now) to make sure that they appear to behave sensibly (so far, they do!).

Discussion welcome on what this option should be called, which section it should live in, and what the accepted CSV file format(s) should look like.

Anaphory commented 8 years ago

Sidenote: Something similar to the geographic clock handling may be necessary for stochastic Dollo models, if I got Remco's comments correctly.

Anaphory commented 8 years ago

Do whatever you need to do for generating results, but what speaks against specifying it as

[model geography]
data = glottolog-languoid.csv
traits = latitude,longitude
model = geosphere

or then in your case

[model geography]
data = only-latlon.csv
traits = lat,lon
#can be left out, if you assume that it has got only these two columns
model = geosphere

? Have you implemented geographical inference in a different way from other substitution models?

xrotwang commented 8 years ago

I'm ok with the config syntax proposed by @Anaphory, except that the default case of using glottolog location data should be dealt with differently. In that case it should be sufficient to specify

data = glottolog

and the selection of the data file will then be done respecting the glottolog-release specified in the admin section.

Anaphory commented 8 years ago

Yes. I would even consider that the default for a [geography] section, I was just thinking of @lmaurits testing geography support with an explicit file.

lmaurits commented 8 years ago

Re: using [geography] instead of [model geography] with model=geosphere. At the moment, the geographic model is handled somewhat differently from all the other models, even though conceptually (from BEAST's perspective) it is the very same kind of thing. In fact, the GeoModel class currently does not even subclass BaseModel!

There are two reasons for this:

  1. I want the structure of config files not to reflect BEAST's conceptual view of the world, but to reflect a typical working linguist's conceptual view of the world (or, rather, my best guess at what that is). I feel like users are likely to think of geography as "a different kind of thing" from "real data".
  2. (which played the stronger role in leading me to do things the way they are currently done), most of the logic associated with either the BaseModel class or the way non-geographic models are treated in Configuration.process() is either useless or actively unwanted for the geographic modes. E.g.:
    • We don't want to filter the features based on some user-specified list or bother to check for missing data, constant values.
    • We don't want to setup a userDataType with a custom codeMap.
    • We don't want to set the languages in an analysis to the union or intersection of the languages in the datafiles with the languages that Glottolog has lat/lon data for.
    • The clock logic is different (detailed above).
lmaurits commented 8 years ago

Re: specifying that we want to use Glottolog data. I want this to be an implicit default, i.e. it should not be necessary to ever write anything like data = glottolog. Rather, data = should only be necessary for providing an alternative source of data.

lmaurits commented 8 years ago

@xrotwang, do you have even a rough idea on when you might be able to find the time to work on the Glottolog geography integration? I'm not trying to rush you and am happy to wait, I just want to calibrate my expectations for when a 1.2.0 release might be possible (so I know how hard to work on docs/testing!).

xrotwang commented 8 years ago

I'm on holidays this week, back on Tuesday. Should be able to work on this next week. 1 day should be enough I hope. Am 22.03.2016 02:57 schrieb "Luke Maurits" notifications@github.com:

@xrotwang https://github.com/xrotwang, do you have even a rough idea on when you might be able to find the time to work on the Glottolog geography integration? I'm not trying to rush you and am happy to wait, I just want to calibrate my expectations for when a 1.2.0 release might be possible (so I know how hard to work on docs/testing!).

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/lmaurits/BEASTling/issues/55#issuecomment-199583391

lmaurits commented 8 years ago

Okay, no problem! Enjoy your holidays.

lmaurits commented 8 years ago

Closing this as all the geography stuff is now in develop and seems to be working nicely.