Closed charlesfoster closed 1 month ago
Thanks for this @charlesfoster. This is a good idea.
However, I would like to support other compression types such as zstd too. You are welcome to add these things to the PR, or I will try and get around to it next week.
Hi again @mbhall88, no worries. I've used this as a chance to get more into Rust. I've added in support for different compression formats based on the extensions of input and output files.
Input
Any input files with .gz, .bgz, .bz2 can be consumed directly by kraken2
. The others are read in using niffler.
Output Output files are written out in different way depending on the extension. I did it this way to allow parallel decompression, which I couldn't seem to enable directly with niffler.
Other I also added in checks for the input/output names to prevent unwanted overwriting etc. Additionally, the user can now get a json-format stats file.
Cheers
Hi @charlesfoster. Sorry for the radio silence.
I like the additions you suggested in this PR. Though I wanted to tweak them and make some of it a little more robust and reduce a couple of dependencies. And also add another feature or two such as a flag to keep, instead of remove, human reads.
I am going to close this PR in favour of #8. Though I have attributed a couple of commits on that PR to you to ensure you are added to the list of contributors for this repository.
Again, thank you very much for you contributions.
No problems at all! I took it as an opportunity to learn more Rust, so I'm not surprised that some parts could have been implemented more robustly :wink: . I'm glad the contributions were useful nonetheless! Cheers.
Hi,
nohuman
seems like a great speedy tool to simplify human read removal. I thought it might be useful to allow the output reads to be gzip compressed to save users from having to do this step separately. I implemented a simple way to optionally compress the output reads using thegzp
crate, which occurs either (a) by default if the input reads have the '.gz' extension, or (b) when the specified output reads have the '.gz' extension. Compression occurs in parallel using the same number of threads specified forkraken2
.Additionally, sometimes it's helpful to see the output log from
kraken2
, e.g.:Accordingly, I added an option to the command line args (
-l
/--kraken2-log
). When a log file destination is specified using this arg, thekraken2
log will be written to that file, otherwise no logging will occur.Other changes include some slight refactoring of the command line args using 'verbatim_doc_comment' so that multiline help text will be properly indented when
nohuman --help
is run, as well as updating of theREADME.md
to reflect these changes.Some tests running locally showing that it's still speedy even with compression:
Hope this changes are useful.
Cheers, Charles