marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

"OverflowError: FASTA/FASTQ record does not fit into buffer" when trimming ONT reads #783

Open diego-rt opened 2 months ago

diego-rt commented 2 months ago

Hi @marcelm

I'm using cutadapt 4.4 with python 3.10.12 and I'm stumbling into this error when trimming the ultra long ULK114 adapters from a specific ONT Promethion flowcell. I'm wondering whether it is related to it having a few megabase size reads.

This is a description of the content of the file:

[diego.terrones@clip-login-1 6890b2ec397f656fd26681dc2d5e9b]$ seqkit stat -a reads.filtered.fq.gz 
file                  format  type  num_seqs        sum_len  min_len   avg_len    max_len      Q1      Q2      Q3  sum_gap     N50  Q20(%)  Q30(%)  GC(%)
reads.filtered.fq.gz  FASTQ   DNA    100,077  4,291,610,866    1,032  42,883.1  1,124,436  18,573  32,187  56,211        0  58,783   90.34   82.26   46.2

This is the command:

cutadapt --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz

This is the output:

This is cutadapt 4.4 with Python 3.10.12
Command line parameters: --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz
Processing single-end reads on 4 cores ...
ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

Traceback (most recent call last):
  File "/usr/local/bin/cutadapt", line 8, in <module>
    sys.exit(main_cli())
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1061, in main_cli
    main(sys.argv[1:])
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1131, in main
    stats = run_pipeline(
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 469, in run_pipeline
    statistics = runner.run()
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 350, in run
    chunk_index = self._try_receive(connection)
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 386, in _try_receive
    raise e
OverflowError: FASTA/FASTQ record does not fit into buffer

Many thanks!

marcelm commented 2 months ago

Hi, that’s interesting. By default, the largest FASTQ record may have 4 million bytes. This includes the quality values, so the maximum read length is about 2 Mbp. I thought this was enough ...

There is actually a hidden (and I believe undocumented) command-line option --buffer-size that you can use to increase the buffer size. Either find out the largest read length, multiply by two and round it up a bit or try with increasingly larger sizes. For example, --buffer-size=16000000 would allow at most reads with approx. 8 Mbp.

diego-rt commented 2 months ago

Ah fantastic! I had found the corresponding line in your code and was about to edit it, but this is much more convenient.

I would say it is not rare to have reads of a few megabases with the ultra long protocols, so might be good to eventually increase the default for this buffer. I think a max read size of ~8 megabases should be pretty safe.

Thanks a lot!

diego-rt commented 2 months ago

I can confirm that --buffer-size=16000000 does the job

marcelm commented 2 months ago

Awesome! Let me re-open this until I’ve found a more permanent solution. Maybe I can make the buffer size dynamic or so.

rhpvorderman commented 2 months ago

You could try the following pattern:

while True:
    try:
        for chunk in dnaio.read_chunks(files[0], self.buffer_size):
            pass
    except OverFlowError:
        self.buffer_size *= 2
        logging.warning("Keep some RAM sticks at the ready!")
        continue 
    else:
        break  # or return to escape the loop
marcelm commented 2 months ago

The strategy is good, but just ignoring the exception and re-trying will lose the contents of the buffer. This would have to be done within read_chunks directly.

rhpvorderman commented 2 months ago

Whoops, you are right. I incorrectly assumed blocks were passed rather than files.