matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Problem with dnaml sequence name trimming reversal #192

Closed metasoarous closed 7 years ago

metasoarous commented 7 years ago

It appears that some of our sequence names are not getting corrected in the process_asr.py script. Because dnaml and dnapars spit out Phylip formatted data, sequence names are trimmed to be no longer than 10 characters. The process_asr.py script is supposed to correct this, but seems to be missing in some cases. In particular, the following sequence name (from laura-mb BF520.1-igh, with minadcl) should have a number after the dash:

image

This is causing various sequences to end up without any duplicity or timepoint information in the output trees, and thus, they can't be clicked on, which has stymied some debugging efforts on other issues. See below:

selection_114

A little hard to say right now why some sequences do appear to be getting corrected but others don't. I'll have to dig in a bit further to figure out why.

metasoarous commented 7 years ago

I believe I've figured out what's going on here. This was actually a subtle bug in the SConstruct. Because the command to build the ASR doesn't directly depend on a the input sequence file, but instead on an input config file which contains a line pointing to the sequence file, SCons didn't know that it needed to update the ASR if the sequences changed (the config file would get updated, but since it would end up being the same, SCons would see everything downstream as a no-op). I've added a manual dependency on the input sequence file, so this shouldn't be a problem any longer.

metasoarous commented 7 years ago

To be clear... this is something that appears to have been present in an earlier version of the SConstruct. It seems that it somehow got taken out at some point, presumably when we were mucking around with different tree/asr methods.