najoshi / sickle

Windowed Adaptive Trimming for fastq files using quality
MIT License
217 stars 95 forks source link

[WIP] implement a CLI option to output singles to same file as interleaved output #28

Closed kdm9 closed 10 years ago

kdm9 commented 10 years ago

G'day Nik,

Firstly, thanks for an amazing tool, gets used all the time here.

However, in my (very) humble opinion, sickle is currently not so easy to use in the middle of a pipeline. If one wanted to pipe output to other commands, it would be more convenient to allow downstream commands to remove single reads, and not split the pipeline three ways at the sickle step. I've implemened a flag, -M or --output-combo-only, that allows interleaved input and output, without splitting out the bad quality single reads. They are instead replaced with a single N, which @vsbuffalo's pairs recognises as a failed read.

Basically, this patch allows one to do:

pairs join 1.fq.gz 2.fq.gz | scythe ... | sickle pe -c /dev/stdin -t sanger -M /dev/stdout |\
    seqqs -i -e -p post_sickle /dev/stdin |\
    pairs split -1 >(gzip >out1.fq.gz) -2 >(gzip >out2.fq.gz) -u >(gzip >singles.fq.gz)

Preliminary testing suggests that output after splitting the resulting interleaved fastq is identical to before this option was implemented. I've also valgrind'd it, finding no issues. I'll do some more rigorous testing over the next week, and comment & remove WIP from the title when I've finished. Release early and often, or so they say :smile:

Cheers, and thanks again for a cool tool! Kevin

najoshi commented 10 years ago

Hi Kevin,

Thanks for your input. I will have to consider your idea after talking to my colleagues, but I'm not sure how useful it would be. I am really trying to tamp down on option creep, but we'll see. :)

kdm9 commented 10 years ago

Thanks for your consideration, Nik!

If option creep is your prime concern, would changing the behavior of the -m/--output-combo option to the behavior i propose above be a better option? I'm unaware of any use cases that this would break, but that's certainly not to say there aren't any. Happy to re-implement this, and resubmit a PR, if you desire.

K

najoshi commented 10 years ago

Hi Kevin,

I talked it over with my colleagues and we decided that it would probably be a good option to put in. I really needed to overhaul the code anyways, so it was a good excuse. :) So, because of that I couldn't really merge your fork and I just wrote it myself. I made the code much more compact and I changed a bunch of things to make it better. If you would be willing to be a tester of this version, that would be much appreciated. If you have the time, please clone the repo and test it out.

On Fri, Jul 18, 2014 at 3:44 AM, Kevin Murray notifications@github.com wrote:

Thanks for your consideration, Nik!

If option creep is your prime concern, would changing the behavior of the -m/--output-combo option to the behavior i propose above be a better option? I'm unaware of any use cases that this would break, but that's certainly not to say there aren't any. Happy to re-implement this, and resubmit a PR, if you desire.

K

— Reply to this email directly or view it on GitHub https://github.com/najoshi/sickle/pull/28#issuecomment-49418089.

Nikhil Joshi Bioinformatics Analyst/Programmer UC Davis Bioinformatics Core http://bioinformatics.ucdavis.edu/ najoshi -at- ucdavis -dot- edu 530.752.2698 (w)

najoshi commented 10 years ago

Just so you know, I fixed a few bugs already, so in case you cloned the repo yesterday, you probably want to get the latest one. :)

On Sat, Jul 19, 2014 at 3:34 AM, Nikhil Joshi najoshi@ucdavis.edu wrote:

Hi Kevin,

I talked it over with my colleagues and we decided that it would probably be a good option to put in. I really needed to overhaul the code anyways, so it was a good excuse. :) So, because of that I couldn't really merge your fork and I just wrote it myself. I made the code much more compact and I changed a bunch of things to make it better. If you would be willing to be a tester of this version, that would be much appreciated. If you have the time, please clone the repo and test it out.

  • Nik.

On Fri, Jul 18, 2014 at 3:44 AM, Kevin Murray notifications@github.com wrote:

Thanks for your consideration, Nik!

If option creep is your prime concern, would changing the behavior of the -m/--output-combo option to the behavior i propose above be a better option? I'm unaware of any use cases that this would break, but that's certainly not to say there aren't any. Happy to re-implement this, and resubmit a PR, if you desire.

K

— Reply to this email directly or view it on GitHub https://github.com/najoshi/sickle/pull/28#issuecomment-49418089.

Nikhil Joshi Bioinformatics Analyst/Programmer UC Davis Bioinformatics Core http://bioinformatics.ucdavis.edu/ najoshi -at- ucdavis -dot- edu 530.752.2698 (w)

Nikhil Joshi Bioinformatics Analyst/Programmer UC Davis Bioinformatics Core http://bioinformatics.ucdavis.edu/ najoshi -at- ucdavis -dot- edu 530.752.2698 (w)

kdm9 commented 10 years ago

Cheers for the heads-up Nik, I did notice a couple of things.

The only one you haven't already fixed is that when a pair is discarded, both reads are still printed to the output file.

I'm not sure if this is a bad thing though. Maybe the section at Line 454 in trim_paired.c could be removed, so that the pair is discarded. This is guaranteed not to break pairing, right?

I'll leave that up to you, but I can see arguments both for and against removing the pairs entirely.

Thanks for your speed in implementing this, having this implemented has made life way easier!

Cheers, K