This PR intends to solve #113, which Duncan also has had to deal with a bit lately. Sometimes when dnapars finds thousands of trees, it just stops writing the outfile randomly, and although gctree will usually parse the outfile, the last tree is badly formed in a way that causes a confusing error later on in the inference process.
Adds a TreeParseError class in phylip_parse.py
Adds checks in parse_outfile that
all parsed node sequences have the same length. This handles the situation where truncation occurs in the middle of a line of the sequences section of the outfile.
all parsed node sequences' lengths match the sequence lengths of the first parsed tree. This handles the situation where truncation occurs between two phylip-format blocks in the sequence section of the outfile, in which case the parsed sequences will be incorrect, but all of the same length. Obviously this doesn't handle the case where truncation occurs in the first tree, but that seems very unlikely.
the number of parsed trees matches the number reported to have been discovered by dnapars at the beginning of the outfile. If they do not match, we throw a warning but continue.
If parsing of a tree fails for any of the reasons above, parse_outfile does not attempt to parse any more trees from the outfile, and returns only the successfully parsed trees, but will always throw a warning if this happens.
These checks should be low-cost, but not totally free. Once per parsed tree, we iterate through the parsed sequence dictionary and accumulate the lengths as a set. Since this set should only have one element, this is about the same as checking that each sequence length is equal to the first. There are slightly faster ways of checking, but I don't think this should be a significant cost.
I tested these changes on a variety of truncated outfiles. I'm not certain they'll handle every case, but I think they should handle any case that I've seen happen. I didn't think it was necessary to commit additional tests, since this is handling a rare edge case.
Changes to CI tests:
It seems that the CI runner macos-latest has recently become Apple Silicon by default, and the phylip package on Bioconda is not available for that architecture. To run tests we don't need phylip, and that was the only reason we were running tests in a Conda environment at all, so I switched to a Pip install, and skip installing phylip altogether.
This PR intends to solve #113, which Duncan also has had to deal with a bit lately. Sometimes when dnapars finds thousands of trees, it just stops writing the outfile randomly, and although gctree will usually parse the outfile, the last tree is badly formed in a way that causes a confusing error later on in the inference process.
TreeParseError
class inphylip_parse.py
parse_outfile
thatparse_outfile
does not attempt to parse any more trees from the outfile, and returns only the successfully parsed trees, but will always throw a warning if this happens.These checks should be low-cost, but not totally free. Once per parsed tree, we iterate through the parsed sequence dictionary and accumulate the lengths as a set. Since this set should only have one element, this is about the same as checking that each sequence length is equal to the first. There are slightly faster ways of checking, but I don't think this should be a significant cost.
I tested these changes on a variety of truncated outfiles. I'm not certain they'll handle every case, but I think they should handle any case that I've seen happen. I didn't think it was necessary to commit additional tests, since this is handling a rare edge case.
Changes to CI tests:
It seems that the CI runner
macos-latest
has recently become Apple Silicon by default, and the phylip package on Bioconda is not available for that architecture. To run tests we don't need phylip, and that was the only reason we were running tests in a Conda environment at all, so I switched to a Pip install, and skip installing phylip altogether.