Closed lmaurits closed 8 years ago
Conceptually, a [geography]
section is a [model]
section with different defaults (“Data source: Get location data from glottolog, supplemented/overridden by the data given in the [languages]
section; substitution model: random walk on spheres”), is it not?
Yes, pretty much. BEASTling's [model]
s more-or-less correspond to BEAUti's "partitions", and geography is implemented in BEAUti as an additional partition.
Are you suggesting we do something like
[model mygeomodel]
model = geography
?
I'm suggesting that the config file parser handles
[geography]
as a short way of writing something like
[model geography]
data = $(beastling-data)/glottolog-languoid.csv
traits = latitude,longitude
model = geosphere
or whatever configuration options we end up with for it.
Glottolog distributes lat/lon data as part of http://glottolog.org/static/download/glottolog-languoid.csv.zip, in two columns "latitude" and "longitude" with decimal-point number values each.
Currently, yes. @xrotwang and I have spoken about adding new URLs for per-release lat/lon and macro-area data in unzipped .csv format, so it can be easily distributed with and/or downloaded by BEASTling, just like the Newick data currently is.
@Anaphory rather than data = .../glottolog-languoid.csv
, this should already be inferred from the global glottolog-release
option. Maybe a data
option would make sense for other, less integrated providers of geographic data.
Yes!
I think it should set default values that can be overridden, and given that we already have support for the glottolog trees, getting the glottolog-languoid.csv
should be similar.
I did a little bit of work on this today. In order to be able to test things, I added a mock load_glotto_geo() method to Configuration (cf load_glotto_classification). The current implementation assigns random (lat,lon) points and random macro-areas to all languages. @xrotwang, when you end up adding the Glottolog integration, as long as you assign the data dictionaries to the same attribute names as I have used in the mock method, then filtering by macro-area and [geography] sections should still work.
The spherical diffusion "substitution" model needs a clock associated with it, just like all our more conventional substitution models. The current behaviour is exactly as per a normal model, i.e. you can explicitly assign any clock you like with a clock=
option, but if you do nothing it will use the (possibly implicit) [clock default]
.
Do we want to keep this behaviour? It is "the simplest thing that could possibly work", but it's also rather a bad approach in general to share a clock between geography and "real data", even if that clock is relaxed or random (there is no reason to expect the variation in rate for these two types to be correlated, and Remco has informed me that when independent clocks are used for data and geography in his Indo-European analyses, there is no evidence of correlation (and much more variation in the geographic clock than the other).
Then again, we have not so far pursued a policy of the default model being a good model, just a simple model. I'm also wary of the alternative - a second implicit, overridable clock [clock geo]
, just because I think it's a bad idea to have too much invisible stuff happening behind the scenes.
Thoughts?
Geographic clock handling is now a bit smarter. If the geography model is sharing a clock with a data model, then the geography's precision parameter is scaled to compensate (otherwise the data and geography will fight to set the mean of the shared clock, assuming said mean is estimated. If the clock is fixed, things will just fit really, really poorly). But if geography gets its own clock, then the precision is fixed and the clock is scaled instead.
I have just committed a really quick and dirty solution for importing location data. You can specify location_data = <filename>
in the [languages]
section. The expected format is CSV, with a header (which is totally ignored and may have any content at all, but should exist if you don't want your first language's data to vanish), and then rows of identifier
, lat
, lon
, where identifier
should be an ISO or Glottocode and lat
and lon
are floats.
I do not propose this as anything remotely like the final solution for this, I just wanted something in there so that I could run analyses using real Glottolog data (instead of the random mock data which is automatically used right now) to make sure that they appear to behave sensibly (so far, they do!).
Discussion welcome on what this option should be called, which section it should live in, and what the accepted CSV file format(s) should look like.
Sidenote: Something similar to the geographic clock handling may be necessary for stochastic Dollo models, if I got Remco's comments correctly.
Do whatever you need to do for generating results, but what speaks against specifying it as
[model geography]
data = glottolog-languoid.csv
traits = latitude,longitude
model = geosphere
or then in your case
[model geography]
data = only-latlon.csv
traits = lat,lon
#can be left out, if you assume that it has got only these two columns
model = geosphere
? Have you implemented geographical inference in a different way from other substitution models?
I'm ok with the config syntax proposed by @Anaphory, except that the default case of using glottolog location data should be dealt with differently. In that case it should be sufficient to specify
data = glottolog
and the selection of the data file will then be done respecting the glottolog-release
specified in the admin
section.
Yes. I would even consider that the default for a [geography] section, I was just thinking of @lmaurits testing geography support with an explicit file.
Re: using [geography]
instead of [model geography]
with model=geosphere
. At the moment, the geographic model is handled somewhat differently from all the other models, even though conceptually (from BEAST's perspective) it is the very same kind of thing. In fact, the GeoModel class currently does not even subclass BaseModel!
There are two reasons for this:
Re: specifying that we want to use Glottolog data. I want this to be an implicit default, i.e. it should not be necessary to ever write anything like data = glottolog
. Rather, data =
should only be necessary for providing an alternative source of data.
@xrotwang, do you have even a rough idea on when you might be able to find the time to work on the Glottolog geography integration? I'm not trying to rush you and am happy to wait, I just want to calibrate my expectations for when a 1.2.0 release might be possible (so I know how hard to work on docs/testing!).
I'm on holidays this week, back on Tuesday. Should be able to work on this next week. 1 day should be enough I hope. Am 22.03.2016 02:57 schrieb "Luke Maurits" notifications@github.com:
@xrotwang https://github.com/xrotwang, do you have even a rough idea on when you might be able to find the time to work on the Glottolog geography integration? I'm not trying to rush you and am happy to wait, I just want to calibrate my expectations for when a 1.2.0 release might be possible (so I know how hard to work on docs/testing!).
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/lmaurits/BEASTling/issues/55#issuecomment-199583391
Okay, no problem! Enjoy your holidays.
Closing this as all the geography stuff is now in develop and seems to be working nicely.
Now that non-strict clocks are looking largely complete, the next major functionality hurdle is to add support for geography to BEASTling.
The plan is for this to be largely powered by yet more integration with Glottolog, which will be able to provide a latitude/longitude point and a macro-area for a great many languages. Of course it should be possible for users to provide their own file of latitude/longitude points, but if no such file is provided then the default is to use Glottolog data.
I envisage two aspects to the geography support.
The first, which is fairly straightforward, is the ability to use geography for filtering the languages in an analysis. At the moment, we have the ability to specify a list of families by name or glottocode, and also the ability to just specify a list of languages. It would be great it filtering could also be achieved by supplying a list of macro-areas. I am open to further filtering support, e.g. filtering by Tsammalex ecoregion, or, even allowing the user to define their own latitude/longitude-based polygon shape to select languages from (assuming there is adequate and lightweight enough Python library support to make this possible), but filtering by macro-area should be quick and easy and so makes sense as a good first target.
The second is to actually add a phylogeographic component to the inference on whatever languages make it through the filter. It does not seem like this will be too hard to do once we have access to a list of latitude/longitude features.
As per the clock situation, I guess most of the planning/discussion which needs to happen is surrounding configuration.
It seems sensible to me that language filtering by macro-area should happen in the [languages] section, i.e. in the same place as all other filtering takes place. A simple
macro-areas =
option should get the job done.As for adding phylogeography, a new
[geography]
section seems appropriate. I had imagined this containing atype=
option to choose between spherical or planar diffusion (with BEASTling making the decision for the user based on the size of the region languages are spread over), however I've now been told that the planar model is never faster than the spherical model and there is essentially no reason to ever use it. So for now the only option that needs to be supported is aclock=
option so that the geographic part of the likelihood can be associated with a clock, exactly as per the current handling of models. I imagine further options will certainly appear as support expands, though.An open question is, if the user wants to specify their own file of location data, where should this be done? The information would be used for both language filtering (done in
[languages]
) and for phylogeography (done in[geography]
), so there's some freedom in where it would go. I suppose[languages]
makes the most sense - actually, it's the only thing that makes sense, as we want to support geographic filtering for analyses which don't actually have a phylogeographic component to the analysis (and hence have no[geography]
section).Does anybody know if there is a widely-used and supported standard file format for exchanging latitude and longitude data? Or, let me guess, there are seven such standards?