lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Do we want shorter alignment lines? #231

Open Anaphory opened 5 years ago

Anaphory commented 5 years ago

I just noticed that the sequence data is in the value attribute of <sequence>, which (according to beast's principle that “There should be three vastly different ways to specify anything you might want to specify in a configuration file”) means that

<sequence id="data_vocabulary:abui1241-fuime" taxon="abui1241-fuime" value="01??" />

can also be specified as

<sequence id="data_vocabulary:abui1241-fuime" taxon="abui1241-fuime">
01??
</sequence>

or even as

<sequence id="data_vocabulary:abui1241-fuime" taxon="abui1241-fuime">
01
??
</sequence>

which, given the hate of of emacs for long lines and the frequency of emacs users in our user base (do I remember correctly, @lmaurits?) might be a useful thing to do. Although it might encourage editing configuration files by hand and it means one cannot just have lines under each other to see the alignments – but the former is sometimes necessary and the latter does not work anyway if taxa names have different lengths, so I suggest that we consider doing this.

Anaphory commented 5 years ago

We could also consider spaces between different features in the alignment, in case one does want to look at alignments column-by-column, actually.

lmaurits commented 5 years ago

Well, let's get the important things out of the way first: I'm a vi(m) user and always have been!

Anyway, my philosophy for a while now has been to develop this aspect of BEASTling with the mentality of a compiler designer. Source code (BEASTling config files) should be optimised for human readability and editability, even if it makes life difficult for the machine (random aside, do our config files support comments? They should), and machine code (BEAST XMLs) should be optimised for speed of execution and/or file size, with no regard for how sensible it looks to crusty old assembly wizards with a hex editor.

Of course the reality is that hand-hacking of BEAST XML still happens, so I'm happy to make small, reasonable deviations from this principle to keep people like us happy, but we should keep the ideal in mind when making these kind of decisions.

In this case, speed is not really a concern - none of your proposed forms will influence the speed of MCMC sampling, only the initial XML parsing, which is epsilon percent of total run time - so it comes down to the question of how much this impacts file size. The removal of value="" / almost entirely balances the addition of <sequence/> - not quite, but then this happens once per long alignment string, so as a percentage of the size of the overall <data> block it's a very negligible increase. So, I think it can be considered a "reasonable" deviation. Then again, by the same argument, the decrease in line length is going to be pretty negligible as well. This isn't likely to bring all the <data> lines to under 80 columns in any realistic analysis, is it?

If we do go ahead with this (and I'm really not opposed, if this does somehow make life as an emacs user easier then go for it) I think I'd prefer your first proposal over your second. The second is going to make the `´ section hundreds of times longer than it already is, which is an impediment to navigation of the file by hand.

I must admit to being seriously intrigued by the idea of putting spaces between features. It encourages extremely naughty hand-editing of data, but it would also make it a lot easier to do quick sanity checks on XML (e.g. do I have ascertainment columns in, do I see the expected number of cognate classes for different meanings, etc.). It goes a long way to transforming the alignment from "opaque blob" to something you can kind of make sense of...

Anaphory commented 5 years ago

random aside, do our config files support comments?

They do. I use them, so I could even just check one of my configs to see whether they appear inside the XML file; they don't.

Well, let's get the important things out of the way first: I'm a vi(m) user and always have been!

Oh, I thought you were dabbling with Java development in Emacs when we were both in Auckland, which led me to assume you might use Emacs.

My own aside:

For file size optimizations, we might put the FilteredAlignments somewhere else and then wrap the tree likelihoods in a <plate> at some point.

Anyway,

Then again, by the same argument, the decrease in line length is going to be pretty negligible as well.

True. It's largely an irrational thing, I guess: overly long xml tag attributes feel weird, outside overly long lines feel reasonable and like something that might be allowed to be split on multiple lines if necessary (thus addressing the issue). Multiline attribute values look just weird.

I must admit to being seriously intrigued by the idea of putting spaces between features.

Yes, I was quite happy to notice whitespace was ignored so I could suggest this.