Open alienzj opened 4 years ago
I haven't given up on this idea, but had few spare cycles to explore it. Now, however, I need this for another project and was prompted to come back and think about it some more.
Following on from the @lh3 thread, it seems that, unfortunately, the BGZF extension will not be great for short read fastq sequences (which is really what fastq-pair is designed to work with).
Specifically, this comment
samtools fqidx should only be used on fastq files with a small number of entries. Trying to use it on a file containing millions of short sequencing reads will produce an index that is almost as big as the original file, and searches using the index will be very slow and use a lot of memory.
on the fqidx documentation) suggests that this will not be a great idea, but I have not (yet) tested it on any real life data.
Currently I used mutli-tools to solve this problem, you can refer to here https://github.com/ohmeta/metapi/blob/dev/metapi/wrappers/preprocess_raw.py#L77 if you want to a urgent solution.
For the feature implementation of fastq-pair, I really want to provide a PR if I can.
If the fastq files are in same order, an alt. way is to read a pair of chunks of the input files into memory, process two chunks for pairing. Then append the two chunk and continue.
If the fastq files are not sorted, we can still read a number of lines into cache, then write paired ones and clean unused cache. And continue. Not unpaired reads will be in cache till the end.
If you are reading the whole file into memory, there are several solutions to this problem. We are explicitly handling files too large for the sequence and qualities to be stored in memory.
Rob
On Thu, 17 Aug 2023 at 13:35, galaxy001 @.***> wrote:
An alt. way is to read a pair of chunks of the input files into memory, process two chunks for pairing. Then append the two chunk and continue.
— Reply to this email directly, view it on GitHub https://github.com/linsalrob/fastq-pair/issues/15#issuecomment-1681578566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGMFB6S7SPSYE5UJDPBWRTXVWKAVANCNFSM4RVY7JRA . You are receiving this because you commented.Message ID: @.***>
Hi, crAssphage man,
Any plan to support gzip input ?
https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files
Good tool, thanks ~