lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Excluding languages without location data #167

Closed Anaphory closed 6 years ago

Anaphory commented 7 years ago

What goes wrong when we don't exclude languages without location data?

https://github.com/lmaurits/BEASTling/blob/bd7c6c155ae482f9693534b12b5a16f44007166c/beastling/configuration.py#L571

lmaurits commented 7 years ago

I don't remember the details, but as-is the resulting analysis will not run. I do know that it is possible to sample tip locations, so in principle an analysis including languages missing location data is perfectly possible. If we can get it working in time I'd be perfectly happy to include that ability in 1.4. This might actually be very easy, I've simply never looked into it as it hasn't yet been an itch for me, but if it's causing you problems I fully encourage you to look into it.

I probably have some XML files lying around somewhere which do geographic tip sampling, if I can find them I'll share the details with you.

Anaphory commented 7 years ago

I had to change the bit where BEASTling tries to convert "?" into decimals with floating point, but otherwise the analysis seems to run perfectly fine with missing entries.

I haven't tried sampling tips.

lmaurits commented 7 years ago

Huh, that's a pleasant surprise! Not sure why I got the idea that it didn't "just work". Feel free to push that change.

I guess we should add an option to drop languages with missing locations in case somebody really wants that?

Anaphory commented 7 years ago

I have inspected the geography nexus file generated by beast and found to my surprise that it lists geolocations for all nodes, including tips and internal nodes, and that the coordinates for the one tip language I checked were constant for the first few steps where I checked them. I have not looked yet what location it is that language gets assigned.

lmaurits commented 7 years ago

Yes, those internal node locations are not sampled but they are estimates of the mean location under some kind of quicky-and-dirty approximation to the diffusion process. There is an option to do a much better job of estimating them using some kind of particle filter, but it slows things down substantially and is not exposed by BEASTling.

I'm curious as to whether or not the location assigned to the tip with the missing data happens to be the exact location of some other node. I think that's a distinct possibility judging from my recently gained understanding of how TraitSets work...

Anaphory commented 7 years ago

Yes, that's what I wanted to check as well.

lmaurits commented 6 years ago

These languages now have their locations sampled, rather than being excluded.