Better handling of clusters with fewer than 3 seqs

Right now there's a step early on in the processing where we read through input files and filter out those with fewer than 3 sequences (actually 2, prior to addition of the naive sequence in the process_partis.py step), since we can't make trees out of 2 seqs. However, this doesn't take into account that the process_partis.py script removes sequences it suspects of serious frameshift mutations (which muck up the alignments and generally suspect of sequencing error etc). So we do still see some clusters getting through with only 2 seqs in them, and this has thwarted my efforts to get Laura's data running.

For now, I have a solution to this, which is to simply let tree construction and subsequent processing fail (see 38cd9a6), and ignore these outputs in the final metadata.json aggregation step. I don't particularly like this solution, because it makes it difficult to catch more serious errors that might crop up. It also leaves us having to Wait on the target to complete (to avoid race conditions in the non-degenerate case), which takes a while to release control since obviously these targets will never get built. However, SCons doesn't give us a lot of flexibility over the flow control here; everything has to be static. So the only way to go here it to either beef up our processing/analysis of the sequence files at the point of the initial filtering of the input data (the cleanest route), or wrap our ancestral state reconstruction and downstream stuff in something that explicitly checks for these issues and leaves tombstones without erroring (a little more work, and feels a little dirtier).

This isn't high priority at the moment, given that ignoring errors does keep us moving for the moment. But it's something we should try to resolve sooner or later before it bites someone.

matsengrp / cft

Better handling of clusters with fewer than 3 seqs #154