lmdu / pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files
https://pyfastx.readthedocs.io
MIT License
262 stars 23 forks source link

Pyfastx.read.raw mangling reads #75

Closed BioWilko closed 7 months ago

BioWilko commented 11 months ago

When iterating through a gzipped fastq file and dumping reads to various output filehandles using read.raw from records stored in a dict reads dumped to files are sometimes mangled. It seems that the way the raw string is generated is flawed in some way leading to weirdness like the below, this occurs predictably and the mangling can be changed by removing a read from the start of the file.

I managed to get around this by not indexing the fastq files in question and reconstructing the record from the name, seq, qual tuple.

An example of a mangled read record:

@SRR5382385.27877 27877 length=135
ATCCAACCGAGCAAGCCGTTCCCGGCTGGTACTCAAATGTAAAAATTTTTATCTGCCGTCTTGAAAATTTTTCCGGCGAACATAAATTTTTATCACGCGACGCCGAATGCGGATCGCGCACTTAATCACTTGGAT
+SRR5382385.27877 27877 length=135
AAF<FFFAFFFJFJFJJJ<7FFF<<<JJF<JJJJFJJFJJFJJJJJJJJJJ<FJFJ--7AF<F<F-FJ-JJJFF<A<-77JFJJF-FJJJJJFA<JFFAAA-<7A-FFAFFJ--7A7FFFJJJFJ77FA--FAFJ
@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTT@SRR5382385.27887 27887 length=135
ACTTTGCGCTCACCGAAAAACGCCGCCTGCAATGCCGGAATACCCAATTGCCGATAGGGGCTGATATTACGAAATTCGCGGGCAAAACTCGGCTCGCCAAACGGGCGGTGCATGGTGCGCGTCAGGTAGTTGCTC
+SRR5382385.27887 27887 length=135
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJ-7FFJJJJJJJJJJJJJJJJJJJ

A script which can be used to reproduce this issue is available here happy to provide any more information if needed to trace the problem.

lmdu commented 11 months ago

Thank you for reporting this issue. I will check and fix it.

lmdu commented 10 months ago

I have download the SRR5382385 to test. But I could not reproduce the issue. Could you provide a simple script for me to reproduce this issue. Thanks.

BioWilko commented 10 months ago

Sure no problem, I successfully replicated the issue with this script:

import pyfastx
import sys

input_fastq = pyfastx.Fastq(sys.argv[1])

for read in input_fastq:
    sys.stdout.write(read.raw)

Using version 2.0.2 installed from pip on an M2 arm64 macbook with a 697M gzipped fastq, whether or not the fastq is compressed seems to have no impact leading to identical mangling. I can provide the specific fastq if desired but I get the same issue with any large fastq.

lmdu commented 10 months ago

Thanks! I have no macbook with M2 arm64 to test. But i will try to do it. That may take a long time. You can try to use python setup.py install to install pyfastx from source code.

BioWilko commented 10 months ago

I just fully reproduced the problem in a fresh environment on an x86_64 cluster like:

mamba  env create -n pyfastx_test pip seqtk

wget {fastq_file}

seqtk size {fastq_file}

Returns 761975 379155685

pip install pyfastx

python pyfastx_minimal_example.py {fastq_file} > test.fastq

seqtk size test.fastq

Returns 646 378608

With the 647th read looking like (quality line for first read merged with read id line from next read):

@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829
ATGTCCTGTTCTTGGTTCAGTTATGCGACCAAGCCATATTCACGGGAAAGTGGCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACTATGTCCCCCGCTCACGCGTCACCGACTGCCAGCGACGGCCGGGTATCGGCCCGACGCTCCAGCGCCATCCATTTTCAGGGCTAGTTGATCCGGCAGGTGAGTTGTTACACACTCCTTAGCGGATTCCGACTTCCACGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGGGTCTGATGAGCGTCGGCATCGGGCGCCTAACCCGGCGTTCGGTTCATCCCGCAGCGCCAGTTCTCTACCAAAAGTGGCCCGCTTCGGCACTCGCATTCCACGCCCGGCTCCACGCCAGCGAGCCGGGCTTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCCCCATGTTGAAGACGGCGCACGAAAACAACAACGACTACAACAGAAATGAGTCAC
+
))**'#$%%&%%$#$%(''''''&''&$####$%&%%&**&''$%%%&'()'(++..166?ABBB><<=@GIEGFGQIFKKIQHHKJC1011/,5@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_@45a218bd-8cb7-4368-b5d9-a4574d98ab7e runid=ca30319f25b536efe90b79a721af63532e657a6c read=867 ch=237 start_time=2023-09-22T09:14:55.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=45a218bd-8cb7-4368-b5d9-a4574d98ab7e basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 barcode=barcode02
ATGTTTTGCATCTTACTTCGTTCAGTTGTGAACGAACCGATATTAGCCGCCTGTCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACATGGGCGCCGTCGGCGGCCGGGCCGAGGTGGGATCCCGAGGCCTCTCCAGTCCGCCGAGGGCGCACCACCGGCCCGTCTCGCCCGCCGCGCCGGGGAGGTGGAGCACGAGCGCACGTGTTAGGACCCGAAAGATGGTGAACTATGCCTGGGCAGGGCGAAGCCAGAGGAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGCGCTCTCGCACGACCACCCCCGCCGTTTATCCGGTAAAGCGAATGATTAGAGGTCTTGGGGCCGAAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTCGCGTGGAGCCGGGCGTGGAGGTGGATGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGATGAACCGAACGCCAGGGTACGGCGCCGATGCCGACGCTCATAGGACCCCCAGAAAAGGTGTTGTTGATATAGACAGCAGGACGGTGGCCATGGAAGTCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCTGAAAATGGATGGCGCTGGAGCGTCGGGCCCATACCCGGCCGTCGCTGGCAGTCGGTGACGCGTGAGAGGGGGACGGGTATGGGCCCGACGCTCCAGCGCCATCCCCTTTCAGGGCTGATTGATACGGCAGGTGAGTCTACACACTCCTTAGCGGATTCCGACTTCCATGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGTCGTGAGCGTCAGCCATGCGGGCGCCTTAACCCGGCGTTGGTCCGATCCCGCAGCGCCGGTCTGCTTACCGAAAGTGGCCCGCTAGGCACTCGCATTCCCACACCCTCCGCGCCGGCCGAGAGCCGCTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCGTTTCGGCCCCAAGACCTCTAATAGTTCGCTTTACCGGATAAAACTGCGGGGGGTGGTCGTGCGAGAGCGCAGCTAGCCTGAGGGAGACCAGAGGGGACCGTACTAGATAGTTCGATTAGTCTTTCGCCCCTACCCAGGTCGGACGACCGAGTTTTAGCATCAGTACGGACCTCCACCAGAGCGTCCTCGCGGCTTCGCCCTGCCCGGGCGTAGCCCAGCCGTCTTTCGGGTCCTAACACGTGCGCTCGTGCTGCCTCCCGGCGCGGCGCGAGACGGCGGGTCGTGCCTCGGCGGACTGGGAGCCGGGACCCCCGCCGCCGACAGCACCATGTTGAAGCGGCGCACGAAAAACGACAGACGACTACAAACGGAATGGAGTCG
(pyfastx) jovyan:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel Xeon Processor (Skylake, IBRS)
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
rmcolq commented 10 months ago

I'm also interested in this fix

lmdu commented 7 months ago

Fixed in v2.1.0

BioWilko commented 7 months ago

Amazing, thanks @lmdu !