mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
203 stars 17 forks source link

Support for paired-end reads? #15

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

Are there plans to support Illumina paired reads?

mbhall88 commented 4 years ago

HA, I had an internal bet with myself about whether this would be the first issue raised.

I am happy to support any feature that users are likely to want/need and that stays within the scope of the project. This definitely seems to tick both of those boxes.

I haven't actually worked much with Illumina data previously so I will list some of the thoughts I had around this and it would be great to have your (and anyone else's) thoughts:

Please also throw-in any other ideas you have.

tseemann commented 4 years ago

Ideally you would take 4 parameters like --in1 --in2 --out1 --out2 because not enough tools support interleaved format.

But for elegance i suspect you don't want that, then you could just take interleaved PE (easily created with <(seqtk mergepe R1.fq.gz R2.fq.gz). The output is usually demerged with two steps (unfortunately) using seqtk seq -1 out.fq | gzip > R1.fq.gz and then with -2 for R2. I need to find a tool that can write both from an input stream to two files.

mbhall88 commented 4 years ago

Ok. I will have a think about how much to achieve this then. I am heading on holidays for 5 weeks this weekend though so it is unlikely I will get around to it before then.

Regarding demerging in one step, pyfastaq can do this.

So I could envision something like this in the future

rasusa -i interleaved.fq -g 4g -c 20 | fastaq deinterleave - r1.fq r2.fq 
mbhall88 commented 4 years ago

Alright. I am finally revisiting this. Sorry for the massive delay. Nothing like COVID-19 to make you revisit all the stuff you should have done ages ago.

On reflection, the interleaved file method is not ideal. I think one of two ideas should work

  1. Allow --input/-i to be passed twice
  2. Switch --input to be a positional parameter instead and allow passing up to two files

In both instances, if two input files are given, it will be assumed it is illumina data.
One issue I do foresee with option 2 is that it will break any pipelines that may be using rasusa so I would have to also support --input as well. The more I type the more I like the idea of option 1.

How does that sound?

tseemann commented 4 years ago

@mbhall88 drowning in COVID work too, but this looks good. thank you