Improving cutadapt speed with functional programming tools map and filter

Hi,

I did a performance profiling of cutadapt:

python -m cProfile -s tottime -m cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ~/test/big2.fastq -o /dev/null --quiet | head -n 50
         70943657 function calls (70942386 primitive calls) in 42.604 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5000000   22.428    0.000   22.428    0.000 {method 'locate' of 'cutadapt._align.Aligner' objects}
        1    5.872    5.872   42.566   42.566 pipeline.py:454(process_reads)
  5000000    2.228    0.000    6.290    0.000 steps.py:278(__call__)
  5000000    1.964    0.000   29.084    0.000 modifiers.py:219(__call__)
  5000000    1.517    0.000   26.655    0.000 modifiers.py:270(_match_and_trim_once_action_trim)
  5000000    1.489    0.000    2.643    0.000 writers.py:149(_write)
  5000000    1.328    0.000   25.082    0.000 adapters.py:1096(match_to)
  5000000    1.269    0.000   23.754    0.000 adapters.py:749(match_to)
  5000000    1.111    0.000    1.419    0.000 statistics.py:16(update)
  5000000    1.040    0.000    1.040    0.000 modifiers.py:42(__init__)
  5000000    0.698    0.000    0.698    0.000 {method 'write' of '_io.BufferedWriter' objects}
10151293/10151082    0.599    0.000    0.599    0.000 {built-in method builtins.len}
  5000000    0.456    0.000    0.456    0.000 {method 'fastq_bytes' of 'dnaio._core.SequenceRecord' objects}
5000104/5000103    0.282    0.000    0.282    0.000 {method 'extend' of 'list' objects}
   148033    0.104    0.000    0.182    0.000 adapters.py:177(add_match)

So what stands out to me is the time needed for the process_reads function.

    def process_reads(
        self, progress: Progress = None
    ) -> Tuple[int, int, Optional[int]]:
        """Run the pipeline. Return statistics"""
        n = 0  # no. of processed reads
        total_bp = 0
        for read in self._reader:
            n += 1
            if n % 10000 == 0 and progress is not None:
                progress.update(10000)
            total_bp += len(read)
            info = ModificationInfo(read)
            for modifier in self._modifiers:
                read = modifier(read, info)
            for filter_ in self._steps:
                if filter_(read, info):
                    break
        if progress is not None:
            progress.update(n % 10000)
        return (n, total_bp, None)

So what happens here is that:

stats are counted
modifiers are applied
filters are applied, with the process stopping if a filter returns true.

This python code is expensive. An alternative option is to:

Make a StatsCounter class with a call method. This can be implemented in cython.
Rewrite the filters to return False when filtering. (Set global DISCARD to False)

Utilize the following pseudocode:

def process_reads(self, progress: Progress = None):
iterator = iter(self._reader)
stats_counter = StatsCounter(progress)
iterator = map(stats_counter, iterator)
for modifier in self._modifiers:
    iterator = map(modifier, iterator)
for filter_ in self._steps:
    iterator = filter(filter_, iterator)
write_reads_to_fastq(iterator)
return stats_counter.n, stats_counter.total_bp, None

This will remove quite a lot of python overhead from the pipeline. I applied this on my fastq-filter program. Admittedly, that one is a bit over optimized (with every filter being written in C), but it works quite well. (see https://github.com/LUMC/fastq-filter/blob/ac6173fadd0d802deecc60cbcf848d810d1f025d/src/fastq_filter/__init__.py#L128)

marcelm / cutadapt

Improving cutadapt speed with functional programming tools map and filter #652