Human reads are actually outputted by nohuman

cpauvert commented 4 months ago

Hi @mbhall88, thanks for developing this tool, this approach is indeed very fast! However when trying out on bacterial genome assembly data (Nanopore and Illumina) I was puzzled that nohuman was throwing the baby out with the bathwater. Therefore I investigated with the manual approach described in https://github.com/mbhall88/classification_benchmark with kraken2 and got:

277081 sequences (1116.92 Mbp) processed in 16.112s (1031.8 Kseq/m, 4159.33 Mbp/m).
  2814 sequences classified (1.02%).
  274267 sequences unclassified (98.98%)

But the output file from nohuman contained 2814 sequences, which I trace the typo to be: https://github.com/mbhall88/nohuman/blob/76a845663e22454c7a05815c1dda7939a2249f48/src/main.rs#L153

I submitted a PR to fix that unclassified reads are wanted, but I'm not familiar with Rust, so I could not recompile/test properly, let me know if I can try out something.

Best, Charlie

mbhall88 commented 4 months ago

Oh wow! This is embarrasing!! Amazing how two characters can make this a completely different program. Thank you so much for detecting this. It will be fixed in v0.1.1

cpauvert commented 4 months ago

You're welcome @mbhall88! Typos happened and having the code available helps so much in these situations! Thanks for the quick reaction!

mbhall88 commented 4 months ago

bioconda package should now be updated btw

mbhall88 / nohuman

Human reads are actually outputted by nohuman #2