Open diego-rt opened 2 months ago
Hi, that’s interesting. By default, the largest FASTQ record may have 4 million bytes. This includes the quality values, so the maximum read length is about 2 Mbp. I thought this was enough ...
There is actually a hidden (and I believe undocumented) command-line option --buffer-size
that you can use to increase the buffer size. Either find out the largest read length, multiply by two and round it up a bit or try with increasingly larger sizes. For example, --buffer-size=16000000
would allow at most reads with approx. 8 Mbp.
Ah fantastic! I had found the corresponding line in your code and was about to edit it, but this is much more convenient.
I would say it is not rare to have reads of a few megabases with the ultra long protocols, so might be good to eventually increase the default for this buffer. I think a max read size of ~8 megabases should be pretty safe.
Thanks a lot!
I can confirm that --buffer-size=16000000
does the job
Awesome! Let me re-open this until I’ve found a more permanent solution. Maybe I can make the buffer size dynamic or so.
You could try the following pattern:
while True:
try:
for chunk in dnaio.read_chunks(files[0], self.buffer_size):
pass
except OverFlowError:
self.buffer_size *= 2
logging.warning("Keep some RAM sticks at the ready!")
continue
else:
break # or return to escape the loop
The strategy is good, but just ignoring the exception and re-trying will lose the contents of the buffer. This would have to be done within read_chunks
directly.
Whoops, you are right. I incorrectly assumed blocks were passed rather than files.
Hi @marcelm
I'm using cutadapt 4.4 with python 3.10.12 and I'm stumbling into this error when trimming the ultra long ULK114 adapters from a specific ONT Promethion flowcell. I'm wondering whether it is related to it having a few megabase size reads.
This is a description of the content of the file:
This is the command:
This is the output:
Many thanks!