linsalrob / fastq-pair

Match up paired end fastq files quickly and efficiently.
https://edwards.flinders.edu.au/sorting-and-paring-fastq-files/
MIT License
142 stars 32 forks source link

Support gzip file #15

Open alienzj opened 4 years ago

alienzj commented 4 years ago

Hi, crAssphage man,

Any plan to support gzip input ?

https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files

Good tool, thanks ~

linsalrob commented 1 year ago

I haven't given up on this idea, but had few spare cycles to explore it. Now, however, I need this for another project and was prompted to come back and think about it some more.

Following on from the @lh3 thread, it seems that, unfortunately, the BGZF extension will not be great for short read fastq sequences (which is really what fastq-pair is designed to work with).

Specifically, this comment

samtools fqidx should only be used on fastq files with a small number of entries. Trying to use it on a file containing millions of short sequencing reads will produce an index that is almost as big as the original file, and searches using the index will be very slow and use a lot of memory.

on the fqidx documentation) suggests that this will not be a great idea, but I have not (yet) tested it on any real life data.

alienzj commented 1 year ago

Currently I used mutli-tools to solve this problem, you can refer to here https://github.com/ohmeta/metapi/blob/dev/metapi/wrappers/preprocess_raw.py#L77 if you want to a urgent solution.

For the feature implementation of fastq-pair, I really want to provide a PR if I can.

galaxy001 commented 1 year ago

If the fastq files are in same order, an alt. way is to read a pair of chunks of the input files into memory, process two chunks for pairing. Then append the two chunk and continue.

If the fastq files are not sorted, we can still read a number of lines into cache, then write paired ones and clean unused cache. And continue. Not unpaired reads will be in cache till the end.

linsalrob commented 1 year ago

If you are reading the whole file into memory, there are several solutions to this problem. We are explicitly handling files too large for the sequence and qualities to be stored in memory.

Rob

On Thu, 17 Aug 2023 at 13:35, galaxy001 @.***> wrote:

An alt. way is to read a pair of chunks of the input files into memory, process two chunks for pairing. Then append the two chunk and continue.

— Reply to this email directly, view it on GitHub https://github.com/linsalrob/fastq-pair/issues/15#issuecomment-1681578566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGMFB6S7SPSYE5UJDPBWRTXVWKAVANCNFSM4RVY7JRA . You are receiving this because you commented.Message ID: @.***>