Continue the refactoring effort. This time on clustering code, placed in own source files, and added many more tests.
Address part of #15 , by reducing ambiguous paths (distinct paths in prg with same sequence). From empirical tests on TB and Pf datasets, it does reduce this problem significantly, and improves downstream genotyping (in gramtools)
One important point is I've changed the command name prg_from_msa to from_msa; this way we call make_prg from_msa. This may be annoying because will need to change scripts using make_prg if we go ahead with this. I find it more descriptive and easier to type, but has to be weighed against changing calls in existing scripts/pipelines.
Added
Unit tests for refactored kmeans clustering functions.
Add failing clustering tests: all sequences very close getting grouped into clusters, sequences with three clusters getting grouped into two clusters
New kmer clustering algorithm based on 'one-ref' principle, replacing inertia-halving criterion. Idea: if a group of sequences is similar to their majority sequence, do not need to be clustered. Similar defined via heuristic on %identity. This solves above failing cases and does not fail existing (eg integration) tests.
Modified
Refactoring of kmeans clustering: take clustering code out of main AlignedSeq object and into own source file.
Kmeans clustering split into more functions, for eg counting distinct kmers, producing kmer count matrix, extracting clusters, merging clusters.
Place prg from msa source files in own from_msa directory
Refactoring of AlignedSeq object: now called PrgBuilder; same for source file. Is clear it builds from msa as located in from_msa directory
Rename prg_from_msa subcommand to from_msa for less redundancy: now call as make_prg from_msa
This PR has two main goals:
One important point is I've changed the command name
prg_from_msa
tofrom_msa
; this way we callmake_prg from_msa
. This may be annoying because will need to change scripts usingmake_prg
if we go ahead with this. I find it more descriptive and easier to type, but has to be weighed against changing calls in existing scripts/pipelines.Added
Modified
AlignedSeq
object and into own source file.from_msa
directoryAlignedSeq
object: now calledPrgBuilder
; same for source file. Is clear it builds from msa as located infrom_msa
directoryprg_from_msa
subcommand tofrom_msa
for less redundancy: now call asmake_prg from_msa
Test coverage
Goes from
in current
master
to
in this branch