Closed bricoletc closed 3 years ago
Hi @mbhall88 @rmcolq , I think this should be merged to master. I can certify that it worked for my gramtools paper analysis, producing PRGs where I found, genotyping on them with gramtools, good concordance with PacBio truth assemblies on two different datasets (TB, Pfalciparum)
This PR has two main goals:
Added
IntervalPartitioner
object and associated classes, in interval_partitioner.py This object replaces thepartition_alignment_into_intervals
function, which was harder to understand, also makes it more easily to modify in order to address #17 and in future. Also added more unit tests for interval partitioning.Before this PR, results in:
AATAATAAT 5 AAATTTTGTATAAACT 6 AAATTTTATATAAACT 6 5 TTACCCTAG
(see the empty third allele`After this PR, results in :
AATAATAA 5 TAAATTTTGTATAAACT 6 TAAATTTTATATAAACT 6 T 5 TTACCCTAG
( the T got prepended) (max_nesting 5, min_match_len 7)This does not mean #17 is fully resolved, i'll now run on real datasets
Modified
An alignment column of all '-' is no longer counted as a match. I cannot find a good reason for it to be but may be missing something?
Moved subcommand parsing to subcommands/ dir, results in less cluttered main()
get_consensus function shorter and easier to understand + add unit tests for it
refactored interval partition checks (checking each position of alignment is in one and one only interval, and switching non-match to match intervals where needed) into own functions and unit tested them.
-v
for verbose switch added at top-level command lineOverall I strove to make the main module
make_prg_from_msa
smaller and terser, for easier reading/modifying later: went from 499 to 385 lines.