lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Make overwriting, rather than replacing locations, possible #132

Closed lmaurits closed 6 years ago

lmaurits commented 7 years ago

It's possible to specify a file of user-provided latitudes and longitudes. At the moment this replaces the Glottolog location data entirely, which means that e.g. if you want to just add location data for one or two languages which are missing it in Glottolog, you have to make a CSV file with N+2 rows (where N is number of Glottolog langs). And that file becomes out of date if the next Glottolog release changes some locations.

It would be better if the user could choose whether to do this or whether to use the file to selectively patch Glottolog by adding or overwriting. That way the above scenario requires only a one or two line CSV file.

Anaphory commented 7 years ago

For the syntax:

Can we just use our two established syntaxes of property = value1,value2,... and property = %(property)s,value0 and use

location_data = "/path/to/locations-leiden.csv"

for overwriting and

location_data = "/path/to/locations-auckland","/path/to/locations-leiden.csv"

for the bits specified in /path/to/locations-leiden.csv having precedence over the ones specified in /path/to/locations-auckland.csv (This behaviour should be familiar from the command line)?

location_data = ,"/path/to/locations-leiden.csv"

for “patch glottolog with /path/to/locations-leiden.csv”?

Note the comma in the beginning – this is the bits I have doubts about. I would want to be able to use location_data = %(location_data)s,"/path/to/locations-leiden.csv" and assume that location_data starts out empty, because that's never going to be a file name.

Other data files

For consistency, the same syntax should work for several data files in a model.

The I can just build a file

[languages]
location_data = %(location_data)s,alorese_locations.csv
[model lexicon]
data = %(data)s,alorese_vocabulary.csv

to add some languages to somebody else's analysis. (Or to mine, but if it was mine, I would have an easier time putting them in my files to start with and using exclude)

lmaurits commented 6 years ago
  1. I like the idea of being able to specify multiple comma-separated location data files, with order implying precendence.
  2. I don't like the idea of using a blank string in this comma-separated list to represent the Glottolog locations. It's too magical (if you saw it in a config without having read the documentation you couldn't even begin to guess what it did) and too subtle (if quickly skimming a config it could be easily overlooked). That said, I don't have a better suggestion right now.
  3. I agree that for consistency the same syntax should work for data inputs, and actually probably everywhere else that you can specify things using a filename (e.g. feature exclusions). Perhaps this should be a separate Issue, though.
Anaphory commented 6 years ago

Empty string is quite magical. It could well be an explicit entry like glottolog-3.0, but it would need to have very low chance of ever being a filename in the current directory, and there would need to be a reasonable parser that doesn't first try to open files before checking for a glottolog specifier, and the location-data property would need to start with such a default value so that

location_data = %(location_data)s,"/path/to/locations-leiden.csv"

works

xrotwang commented 6 years ago

A really explicit way to do this would be URLs, i.e. using the file scheme file:// to mark filenames and probably an HTTP URL to specify the Glottolog locations. This may generally be the most platform-independent way to specify files in the config - but then, file names typically don't have to be platform independent, because they are exactly the platform or even machine-specific part of configurations ...

Anaphory commented 6 years ago

Forcing URL schemes would be unhelpful, because we will want to permit relative paths. Permitting them might be reasonable, but then caching needs to be explicitly discussed. So far, we have assumed that glottolog-3.0 stays glottolog-3.0 and any changes will lead to a new version.

lmaurits commented 6 years ago

I'm starting to wonder whether or not we are overthinking this. The requirement to be able to unambiguously specify "Glottolog" as one of the sources of location data only arises if we insist on being able to distinguish between "patch the Glottolog locations" and "overwrite Glottolog entirely". Do we actually need the total overwrite option? How is this ever practically different from patching Glottolog with a file which contains new locations for all the languages in my analysis?

Anaphory commented 6 years ago

You may be right. Currently, the only case I can think of is very constructed:

I have some vast data source which may be updated from time to time, with glottolog identifiers, but I don't trust it, so I want to specify my own location data which does not come from that source, and I want errors to happen because of missing locations (because languages may have been added) instead of falling back to Glottolog.

That's all, and it's very contrieved and in no way good practice.

(That is, the practical difference is that overwriting could not use Glottolog as fallback, which is something only useful when I don't know my data and hate Glottolog more than data errors.)

xrotwang commented 6 years ago

@lmaurits I think I agree. So the actual process is this: Specified location_data will be supplemented with data from Glottolog for the languoids it lacks information for, right? So if you want to do away with all Glottolog data, you'd start your own location data with an empty compty of Glottolog's.

lmaurits commented 6 years ago

Hmm. Actually, that's not that awful. If I really dislike Glottolog's location for some language and would rather a language be dropped from a phylogeographic analysis than use GL location, I should be able to do that. Of course, I can using excludes, but this is a bit clunky. Then again, edge cases being clunky is perhaps to be expected...

lmaurits commented 6 years ago

@xrotwang Yes, that's what I had in mind (that specified location_data will be supplemented with data from Glottolog for the languoids it lacks information for). The only real question is how to prevent that, e.g. if I don't like Glottolog's location for Finnish but I don't have a better one of my own, what do I do? At the moment the only option is to use [languages]/exclude to drop Finnish. Perhaps our file format for specifying locations (which needs work anyway, see the very valid Issue #149) could have some way of doing this, e.g. fin = ?,?.

xrotwang commented 6 years ago

Does it make sense to run an analysis with a geographic component on incomplete location data? If so, what's the way to relay this info to BEAST?

lmaurits commented 6 years ago

At the moment this is unsupported by BEASTling (languages without locations in Glottolog will be dropped from phylogeographic analyses), but I am pretty sure (haven't tried it yet, though) that you can actually ask BEAST to just sample the location for those taxa (and you could constrain the sampling to some prior polygon if you wanted to). It probably does make sense to want to do this, as you might have good linguistic data for one of those taxa and it would be a shame to drop that out of the analysis just because of missing geographic data.

lmaurits commented 6 years ago

Default behaviour is now to patch rather than replace Glottolog.

Still no support for multiple files.

lmaurits commented 6 years ago

Forgot to tag this Issue in the commit messages, but recent commits mean that:

  1. Multiple files are now supported, via comma-separated lists of filenames. Files specified later in the list override locations from files specified earlier.
  2. Either the latitude, longitude or both of a location may be ?s, in which case the language will be dropped from a geographic analysis. This essentially provides a way to remove Glottolog locations which the user disagrees with.