Normal runtime for cutadapt

GoogleCodeExporter commented 9 years ago

I am trimming paired-end reads (forward and reverse files are 2.7G) with 3 
adapter sequences provided in a fasta file. It has been running overnight and 
it still hasn't finished trimming. Currently, it is still creating the tmp 
files shown in the paired-end trimming instructions provided in the manual. I 
was wondering how fast the application normally takes and whether there were 
any know Python issues which may slow it down considerably. Thank you.

Original issue reported on code.google.com by jma1...@icloud.com on 10 Dec 2014 at 1:58

GoogleCodeExporter commented 9 years ago

Sounds strange, that shouldn’t take so long. If I trim a single adapter 
(length 33 bp) from 1 million 100 bp reads stored in a gzip-compressed FASTQ 
file, it takes 10 seconds (Intel Core i7). With three adapter sequences, it 
takes about twice as long. Extrapolating that to 2.7 GB (compressed) gives me a 
runtime of about 10 minutes for your dataset, so something is clearly different.

How long are the reads? How many are there? Do you use compressed input and/or 
output files? How long are the adapter sequences? Are there wildcard characters 
in the adapter sequences? Are you sure you specified the correct FASTA file 
with adapters? Cutadapt gets very slow if you give it many adapters to trim.

Original comment by marcel.m...@tu-dortmund.de on 10 Dec 2014 at 2:47

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

1. The reads are 125bp
2. There are ~33M in each file
3. I use .fastq.gz files
4. Adapter sequences are between 10 and 63 bp long
5. No wildcards
6. Yes, I think so
7. I've tried between 3 and 8 adapter sequences

When I run cutadapt on the server I use, it is considerably slower than when I 
run it on my desktop machine. I imagine that's because cutadapt doesn't support 
multi-threading, so the single fast CPU on my desktop will perform better than 
one of the single slow CPUs it has access to on a server.

Original comment by jma1...@icloud.com on 10 Dec 2014 at 5:00

GoogleCodeExporter commented 9 years ago

These parameters look pretty standard to me. I forgot to ask which Python and 
cutadapt version you are using. The problem could be due to gzip decompression 
(and compression, in case you also write to a .fastq.gz). Python's gzip 
implementation is very slow in some versions. I try to work around that in 
cutadapt, but it sometimes does not work and then the slow implementation is 
used. I still don't think that would explain a slowdown of the magnitude you 
are observing.

Sorry about missing multithreading in cutadapt. I've tried twice to implement 
it, but both times the multithreaded version turned out to be slower than the 
single-threaded version, possibly due to the overhead in communication between 
threads. Maybe I should have another stab at it.

One other thought: How did you install cutadapt? Do you know whether the Python 
extension modules (that implement the alignment algorithm) were re-compiled? 
Perhaps no optimization was used ...

Have any of your runs finished? If so, what does it say in the report under 
"Time per read"? (I see 0.010ms in my single-adapter test case.) If none have 
finished and you feel like it, you could prepare a smaller test file with only 
100000 reads or so and then tell me how long cutadapt takes to run on that with 
the same set of adapters.

Original comment by marcel.m...@tu-dortmund.de on 10 Dec 2014 at 5:47

GoogleCodeExporter commented 9 years ago

Hi, even if you don't have time to reply in detail, could you please tell me 
which Python version you are using?

Original comment by marcel.m...@tu-dortmund.de on 15 Dec 2014 at 4:59

GoogleCodeExporter commented 9 years ago

Closing since there has been no reply. If this issue still applies, please add 
a comment.

Original comment by marcel.m...@tu-dortmund.de on 4 Mar 2015 at 9:38

Changed state: Invalid

jgaetel / cutadapt

Normal runtime for cutadapt #93