Closed metasoarous closed 7 years ago
This is now more or less resolved. In the process_partis.py
script (where we do this filtering) is now set up to not remove "bad" sequences if it would put the resulting cluster size below 3 (including seed and naive). This resolves all of the technical concerns mentioned above, and will only affect a very small handful of very small clusters, and should be more or less obvious looking at the alignments.
Does this sound ok with you @lauranoges? I'm going to close for now, but feel free to reopen if you have concerns.
Right now there's a step early on in the processing where we read through input files and filter out those with fewer than 3 sequences (actually 2, prior to addition of the naive sequence in the
process_partis.py
step), since we can't make trees out of 2 seqs. However, this doesn't take into account that theprocess_partis.py
script removes sequences it suspects of serious frameshift mutations (which muck up the alignments and generally suspect of sequencing error etc). So we do still see some clusters getting through with only 2 seqs in them, and this has thwarted my efforts to get Laura's data running.For now, I have a solution to this, which is to simply let tree construction and subsequent processing fail (see 38cd9a6), and ignore these outputs in the final
metadata.json
aggregation step. I don't particularly like this solution, because it makes it difficult to catch more serious errors that might crop up. It also leaves us having toWait
on the target to complete (to avoid race conditions in the non-degenerate case), which takes a while to release control since obviously these targets will never get built. However, SCons doesn't give us a lot of flexibility over the flow control here; everything has to be static. So the only way to go here it to either beef up our processing/analysis of the sequence files at the point of the initial filtering of the input data (the cleanest route), or wrap our ancestral state reconstruction and downstream stuff in something that explicitly checks for these issues and leaves tombstones without erroring (a little more work, and feels a little dirtier).This isn't high priority at the moment, given that ignoring errors does keep us moving for the moment. But it's something we should try to resolve sooner or later before it bites someone.