mbhall88 / nohuman

Remove human reads from a sequencing run
https://doi.org/10.1093/gigascience/giae010
MIT License
24 stars 3 forks source link

New features: optional compression of output and optional saving of kraken2 log (from stderr) #7

Closed charlesfoster closed 1 month ago

charlesfoster commented 2 months ago

Hi,

nohuman seems like a great speedy tool to simplify human read removal. I thought it might be useful to allow the output reads to be gzip compressed to save users from having to do this step separately. I implemented a simple way to optionally compress the output reads using the gzp crate, which occurs either (a) by default if the input reads have the '.gz' extension, or (b) when the specified output reads have the '.gz' extension. Compression occurs in parallel using the same number of threads specified for kraken2.

Additionally, sometimes it's helpful to see the output log from kraken2, e.g.:

Loading database information... done.
255485 sequences (65.34 Mbp) processed in 0.875s (17523.7 Kseq/m, 4481.81 Mbp/m).
  4948 sequences classified (1.94%)
  250537 sequences unclassified (98.06%)

Accordingly, I added an option to the command line args (-l / --kraken2-log). When a log file destination is specified using this arg, the kraken2 log will be written to that file, otherwise no logging will occur.

Other changes include some slight refactoring of the command line args using 'verbatim_doc_comment' so that multiline help text will be properly indented when nohuman --help is run, as well as updating of the README.md to reflect these changes.

Some tests running locally showing that it's still speedy even with compression:

[2024-09-05T02:33:26Z INFO ] Running kraken2...
[2024-09-05T02:33:29Z INFO ] Kraken2 finished. Organising output...
[2024-09-05T02:33:30Z INFO ] Output files written to: "input_1.nohuman.fq.gz" and "input_2.nohuman.fq.gz"
[2024-09-05T02:33:30Z INFO ] Done.

./nohuman -t8 --db db input_1.fq.gz input_2.fq.gz  
11.89s user 2.99s system 367% cpu 4.043 total
time ./nohuman -t8 --db db input_1.fq.gz input_2.fq.gz -o input_1.nohuman.fq -O input_2.nohuman.fq
[2024-09-05T02:37:33Z INFO ] Running kraken2...
[2024-09-05T02:37:36Z INFO ] Kraken2 finished. Organising output...
[2024-09-05T02:37:36Z INFO ] Output files written to: "input_1.nohuman.fq" and "input_2.nohuman.fq"
[2024-09-05T02:37:36Z INFO ] Done.

./nohuman -t8 --db db input_1.fq.gz input_2.fq.gz -o input_1.nohuman.fq -O   
7.71s user 2.79s system 334% cpu 3.144 total

Hope this changes are useful.

Cheers, Charles

mbhall88 commented 2 months ago

Thanks for this @charlesfoster. This is a good idea.

However, I would like to support other compression types such as zstd too. You are welcome to add these things to the PR, or I will try and get around to it next week.

charlesfoster commented 2 months ago

Hi again @mbhall88, no worries. I've used this as a chance to get more into Rust. I've added in support for different compression formats based on the extensions of input and output files.

Input Any input files with .gz, .bgz, .bz2 can be consumed directly by kraken2. The others are read in using niffler.

Output Output files are written out in different way depending on the extension. I did it this way to allow parallel decompression, which I couldn't seem to enable directly with niffler.

Other I also added in checks for the input/output names to prevent unwanted overwriting etc. Additionally, the user can now get a json-format stats file.

Cheers

mbhall88 commented 1 month ago

Hi @charlesfoster. Sorry for the radio silence.

I like the additions you suggested in this PR. Though I wanted to tweak them and make some of it a little more robust and reduce a couple of dependencies. And also add another feature or two such as a flag to keep, instead of remove, human reads.

I am going to close this PR in favour of #8. Though I have attributed a couple of commits on that PR to you to ensure you are added to the list of contributors for this repository.

Again, thank you very much for you contributions.

charlesfoster commented 1 month ago

No problems at all! I took it as an opportunity to learn more Rust, so I'm not surprised that some parts could have been implemented more robustly :wink: . I'm glad the contributions were useful nonetheless! Cheers.