matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Handle truncated dnapars `outfile`s correctly #129

Closed willdumm closed 3 months ago

willdumm commented 3 months ago

This PR intends to solve #113, which Duncan also has had to deal with a bit lately. Sometimes when dnapars finds thousands of trees, it just stops writing the outfile randomly, and although gctree will usually parse the outfile, the last tree is badly formed in a way that causes a confusing error later on in the inference process.

These checks should be low-cost, but not totally free. Once per parsed tree, we iterate through the parsed sequence dictionary and accumulate the lengths as a set. Since this set should only have one element, this is about the same as checking that each sequence length is equal to the first. There are slightly faster ways of checking, but I don't think this should be a significant cost.

I tested these changes on a variety of truncated outfiles. I'm not certain they'll handle every case, but I think they should handle any case that I've seen happen. I didn't think it was necessary to commit additional tests, since this is handling a rare edge case.

Changes to CI tests:

It seems that the CI runner macos-latest has recently become Apple Silicon by default, and the phylip package on Bioconda is not available for that architecture. To run tests we don't need phylip, and that was the only reason we were running tests in a Conda environment at all, so I switched to a Pip install, and skip installing phylip altogether.