marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

The performance for the long read, and comparison with porechop #705

Open Flower9618 opened 1 year ago

Flower9618 commented 1 year ago

Hi, thank you very much for sharing this tool.

I would like to know two questions:

  1. what is the difference between cutadapt and porechop, which is another adapter trimming tool.
  2. How about the performance of cutadapt for long reads, like nanopore or pacbio sequencing data.

Thank you so much!

marcelm commented 1 year ago

Hi, sorry I won’t be able to help much with these questions.

what is the difference between cutadapt and porechop

I don’t know as I’m not familiar with porechop.

How about the performance of cutadapt for long reads, like nanopore or pacbio sequencing data.

Cutadapt can definitely be used for long reads, but I would assume there are other tools that are faster. I don’t know if porechop is one of those tools. It will probably also depend on what one does exactly.

rhpvorderman commented 10 months ago

@marcelm. I am currently working on a request from our clinical genetics department to align a Nanopore file. While working on my sequence quality tool I found that nanopore reads have actually a lot of adapter content in them. Porechop is a tool to remove common adapter sequences from nanopore reads, but it is currently unmaintained.

I think cutadapt can do this as well and it should work with the existing code. I'll try to make a PR to the cutadapt documentation once I get a handle on how things should be handled propery. In the meantime I will be working on my sequencing quality tool to get the detection of these sequences done properly.

@Flower9618 I did a check on porechop's code to see how it performs. In short it suffers from a very common mistake stemming from a misconception of the hardware. Porechop reads all the reads into memory and then each method reads all the reads until all the reads are output to the file. This will be incredibly slow. No amount of C++ code is going to remedy that. Cutadapt by contrast reads one read, and then performs all operations on that single read and then immediately writes it to file. (For multithreading it is slightly different but the basic principle holds). This means that cutadapt will have much less working memory, and as a result it will be much faster. Memory access is slow, therefore virtually every CPU has a bit of on-die memory which is small and fast to cache the memory. This on-die cache memory is small, and therefore programs that use little memory are faster. This is a bit oversimplified but hopefully a good enough explanation.

@Flower9618 If you have a link to some documentation on how to properly preprocess nanopore sequencing data that would be much appreciated. Thanks!

Flower9618 commented 10 months ago

@rhpvorderman Thank you so much for your information, which is helpful for me to study these two tools. I am still finding a tool which has a good performance on trimming adapter for nanopore DNA sequence

rhpvorderman commented 10 months ago

@Flower9618 I just used cutadapt to trim a file with 1.5 million reads. It took 20 minutes. Here is the command used:

# Cut out ligation kit adapters
# Allow 1 in 8 errors
# Only use significantly long reads (300 bp or longer)
# -Z for very fast compression (level 1)
cutadapt -o ${FASTQ}.cutadapt.fastq.gz \
    -Z \
    -g TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT \
    -a GCAATACGTAACTGAACGAAGTACAGG \
    -e 0.125 \
    -m 300 \
    $FASTQ

By contrast porechop took around 6 hours. So 18 times slower. While also using 40+ gigabytes of memory. So I could not have run this on any of the three machines I have available to me. I had to resort to using the compute cluster. (Not a problem for me, but not being to able to run it on a laptop is a huge drawback).

I heard there is also a tool called chopper (https://github.com/wdecoster/chopper) by the nanoplot author. But that looks like a poor man's version of cutadapt to me. Essentially does the same thing, without the 10+ years of refinement and optimization that cutadapt has. Frankly the only thing that cutadapt has missing is an --average-error-rate to filter rather than a --max-expected-errors. But that should be trivial to add.

So I'd go with cutadapt for nanopore adapter trimming. (Keep in mind that I maybe slightly biased on this, as I have frequent correspondence with the author). If you have any other tools that can do the same job please bring them to my attention. I am quite curious.

Flower9618 commented 10 months ago

@rhpvorderman Thank you so much for sharing this information. 😊