marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 129 forks source link

Improving cutadapt speed with functional programming tools map and filter #652

Open rhpvorderman opened 1 year ago

rhpvorderman commented 1 year ago

Hi,

I did a performance profiling of cutadapt:

python -m cProfile -s tottime -m cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ~/test/big2.fastq -o /dev/null --quiet | head -n 50
         70943657 function calls (70942386 primitive calls) in 42.604 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5000000   22.428    0.000   22.428    0.000 {method 'locate' of 'cutadapt._align.Aligner' objects}
        1    5.872    5.872   42.566   42.566 pipeline.py:454(process_reads)
  5000000    2.228    0.000    6.290    0.000 steps.py:278(__call__)
  5000000    1.964    0.000   29.084    0.000 modifiers.py:219(__call__)
  5000000    1.517    0.000   26.655    0.000 modifiers.py:270(_match_and_trim_once_action_trim)
  5000000    1.489    0.000    2.643    0.000 writers.py:149(_write)
  5000000    1.328    0.000   25.082    0.000 adapters.py:1096(match_to)
  5000000    1.269    0.000   23.754    0.000 adapters.py:749(match_to)
  5000000    1.111    0.000    1.419    0.000 statistics.py:16(update)
  5000000    1.040    0.000    1.040    0.000 modifiers.py:42(__init__)
  5000000    0.698    0.000    0.698    0.000 {method 'write' of '_io.BufferedWriter' objects}
10151293/10151082    0.599    0.000    0.599    0.000 {built-in method builtins.len}
  5000000    0.456    0.000    0.456    0.000 {method 'fastq_bytes' of 'dnaio._core.SequenceRecord' objects}
5000104/5000103    0.282    0.000    0.282    0.000 {method 'extend' of 'list' objects}
   148033    0.104    0.000    0.182    0.000 adapters.py:177(add_match)

So what stands out to me is the time needed for the process_reads function.

    def process_reads(
        self, progress: Progress = None
    ) -> Tuple[int, int, Optional[int]]:
        """Run the pipeline. Return statistics"""
        n = 0  # no. of processed reads
        total_bp = 0
        for read in self._reader:
            n += 1
            if n % 10000 == 0 and progress is not None:
                progress.update(10000)
            total_bp += len(read)
            info = ModificationInfo(read)
            for modifier in self._modifiers:
                read = modifier(read, info)
            for filter_ in self._steps:
                if filter_(read, info):
                    break
        if progress is not None:
            progress.update(n % 10000)
        return (n, total_bp, None)

So what happens here is that:

This python code is expensive. An alternative option is to:

marcelm commented 1 year ago

Interesting idea. I’ll have to see whether this can be applied. I did some experiments converting all of pipeline.py to Cython a while ago, but that didn’t really help, so I don’t know how much of an improvement this can be.

However, what I’ve often seen when profiling is that creating a ModificationInfo instance is relatively slow, so I took this opportunity to move this over to Cython. It’s not a huge improvement, but it helps a bit. See #655.