marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
61 stars 9 forks source link

FastqFormatError, while reading in chunks #107

Closed arunhpatil closed 1 year ago

arunhpatil commented 1 year ago

Hi @marcelm,

I am revising the miRge3.0 code which use Cutadapt for read trimming (for single-end data). Here, I am trying to pass reads in chunks. Below is the code and followed by the error:

buffer_size = 4 * 1024**2
with xopen(infile, "rb") as f:
        tlds = [i for i in dnaio.read_chunks(f,buffer_size)]

chunks_ind = []
for i in tlds:
    z = io.BytesIO(i)
    with dnaio.open(z) as f:
        chunks_ind.append([k for k in f])

I later send chunks_ind in Process.Pool.Executor() function to process the list in parallel.

Traceback (most recent call last):
  File "chunk_digest_mp.py", line 41, in <module>
    chunks_ind.append([k for k in f])
  File "chunk_digest_mp.py", line 41, in <listcomp>
    chunks_ind.append([k for k in f])
  File "src/dnaio/_core.pyx", line 183, in fastq_iter
dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1341: Line expected to start with '@', but found 'J'

If I give different buffer size, I get different Error in FASTQ format at differnet line. However, the files in itself don't have issues. I don't know what I am missing here and memory_file can't be sent through pool since it can't be pickled. Any suggestions/thoughts?

Thank you, Arun.

rhpvorderman commented 1 year ago

From the docstring

Read chunks of complete FASTA or FASTQ records from a file. If the format is detected to be FASTQ, all chunks except possibly the last contain an even number of records such that interleaved paired-end reads remain in sync. The yielded memoryview objects are only valid for one iteration because the internal buffer is re-used in the next iteration. Arguments: f: File with FASTA or FASTQ reads; must have been opened in binary mode buffer_size: Largest allowed chunk size Yields: memoryview representing the chunk. This becomes invalid on the next iteration. Raises: ValueError: A FASTQ record was encountered that is larger than buffer_size. UnknownFileFormat: The file format could not be detected (the first byte must be "@", ">" or "#")

(Emphasis mine)

The problem is that [i for i in dnaio.read_chunks(f,buffer_size)] puts all the iterations in a list. All but the last one will have become invalid.

If you want to do multiprocessing, I think the way to go is

for chunk in dnaio.read_chunks(f, buffer_size):
    # bytes() will copy the data to a buffer that will not be invalidated
    send_to_pool_executor(bytes(chunk))
   # And then do the parsing in the individual threads

Does that help you?

EDIT: As a more high-level comment. Cutadapt already supports multiple threads itself. Wouldn't it be easier to use the --cores flag on cutadapt? I am not familiar with miRge3.0 so I might completely miss the mark here.

arunhpatil commented 1 year ago

Hi @rhpvorderman,

Thank you for quick response. The miRge3.0 is a small RNA-seq (miRNAs in general) specific aligner tool, I use cutadapt internally to trim reads and use those reads to align to various reference libraries, with a lot of data segregation steps happening internally. There by minimizing IO operations.

I checked the bytes option, it is good solution, however, I will loose the class object 'SequenceRecord' as bytes are immutable. How can I get that class instances ('SequenceRecord') for each chunk? Is that possible?

Thank you, Arun.

rhpvorderman commented 1 year ago

I checked the bytes option, it is good solution, however, I will loose the class object 'SequenceRecord' as bytes are immutable. How can I get that class instances ('SequenceRecord') for each chunk? Is that possible?

z = io.BytesIO(i)
with dnaio.open(z) as f:
    chunks_ind.append([k for k in f])

Same as what you had before, but i is now a bytes object instead of a memoryview. And the parsing happens in the separate thread where you sent the bytes.

arunhpatil commented 1 year ago

Hi @rhpvorderman,

Ah. How did I miss this?. Thank you so much. This worked.

Cheers, Arun.

rhpvorderman commented 1 year ago

You are welcome. I am glad you were able to figure it out.