Closed arunhpatil closed 1 year ago
From the docstring
Read chunks of complete FASTA or FASTQ records from a file. If the format is detected to be FASTQ, all chunks except possibly the last contain an even number of records such that interleaved paired-end reads remain in sync. The yielded memoryview objects are only valid for one iteration because the internal buffer is re-used in the next iteration. Arguments: f: File with FASTA or FASTQ reads; must have been opened in binary mode buffer_size: Largest allowed chunk size Yields: memoryview representing the chunk. This becomes invalid on the next iteration. Raises: ValueError: A FASTQ record was encountered that is larger than buffer_size. UnknownFileFormat: The file format could not be detected (the first byte must be "@", ">" or "#")
(Emphasis mine)
The problem is that [i for i in dnaio.read_chunks(f,buffer_size)]
puts all the iterations in a list. All but the last one will have become invalid.
If you want to do multiprocessing, I think the way to go is
for chunk in dnaio.read_chunks(f, buffer_size):
# bytes() will copy the data to a buffer that will not be invalidated
send_to_pool_executor(bytes(chunk))
# And then do the parsing in the individual threads
Does that help you?
EDIT: As a more high-level comment. Cutadapt already supports multiple threads itself. Wouldn't it be easier to use the --cores
flag on cutadapt? I am not familiar with miRge3.0 so I might completely miss the mark here.
Hi @rhpvorderman,
Thank you for quick response. The miRge3.0 is a small RNA-seq (miRNAs in general) specific aligner tool, I use cutadapt internally to trim reads and use those reads to align to various reference libraries, with a lot of data segregation steps happening internally. There by minimizing IO operations.
I checked the bytes option, it is good solution, however, I will loose the class object 'SequenceRecord' as bytes are immutable. How can I get that class instances ('SequenceRecord') for each chunk? Is that possible?
Thank you, Arun.
I checked the bytes option, it is good solution, however, I will loose the class object 'SequenceRecord' as bytes are immutable. How can I get that class instances ('SequenceRecord') for each chunk? Is that possible?
z = io.BytesIO(i)
with dnaio.open(z) as f:
chunks_ind.append([k for k in f])
Same as what you had before, but i
is now a bytes object instead of a memoryview. And the parsing happens in the separate thread where you sent the bytes.
Hi @rhpvorderman,
Ah. How did I miss this?. Thank you so much. This worked.
Cheers, Arun.
You are welcome. I am glad you were able to figure it out.
Hi @marcelm,
I am revising the miRge3.0 code which use Cutadapt for read trimming (for single-end data). Here, I am trying to pass reads in chunks. Below is the code and followed by the error:
I later send chunks_ind in Process.Pool.Executor() function to process the list in parallel.
If I give different buffer size, I get different Error in FASTQ format at differnet line. However, the files in itself don't have issues. I don't know what I am missing here and memory_file can't be sent through pool since it can't be pickled. Any suggestions/thoughts?
Thank you, Arun.