The filter module currently serves three purposes: merging interesting k-mer annotations from multiple files (such as when kevlar is run in banding mode), recomputing interesting k-mer abundances, and discarding masked k-mers and k-mers whose corrected abundances no longer meet required thresholds.
The first purpose can and should be separated from the other two purposes. The current implementation of kevlar filter can consume hundreds of gigabytes, negating recent performance improvements based on masking at counting time and the use of error correction.
I suggest the following updates.
a new kevlar meld command (or merge, join, or unband) that will take N augfastq files (with some reads present in multiple files) and create a single file with no duplicated reads and all ikmer annotations aggregated
suggested implementation: write reads to M temporary files, and determine which file a read goes to based on hash(readname) % M; then load each file into memory and resolve duplicate reads that way
reimplement the kevlar filter module in two passes: first pass to populate the count table, second pass to recompute k-mer abundances and discard k-mers/reads.
I guess it's time to abandon the streaming interface for some of kevlar's decidedly non-streaming algorithms.
The
filter
module currently serves three purposes: merging interesting k-mer annotations from multiple files (such as when kevlar is run in banding mode), recomputing interesting k-mer abundances, and discarding masked k-mers and k-mers whose corrected abundances no longer meet required thresholds.The first purpose can and should be separated from the other two purposes. The current implementation of
kevlar filter
can consume hundreds of gigabytes, negating recent performance improvements based on masking at counting time and the use of error correction.I suggest the following updates.
kevlar meld
command (ormerge
,join
, orunband
) that will take N augfastq files (with some reads present in multiple files) and create a single file with no duplicated reads and all ikmer annotations aggregatedhash(readname) % M
; then load each file into memory and resolve duplicate reads that waykevlar filter
module in two passes: first pass to populate the count table, second pass to recompute k-mer abundances and discard k-mers/reads.