Closed GoogleCodeExporter closed 9 years ago
Sounds strange, that shouldn’t take so long. If I trim a single adapter
(length 33 bp) from 1 million 100 bp reads stored in a gzip-compressed FASTQ
file, it takes 10 seconds (Intel Core i7). With three adapter sequences, it
takes about twice as long. Extrapolating that to 2.7 GB (compressed) gives me a
runtime of about 10 minutes for your dataset, so something is clearly different.
How long are the reads? How many are there? Do you use compressed input and/or
output files? How long are the adapter sequences? Are there wildcard characters
in the adapter sequences? Are you sure you specified the correct FASTA file
with adapters? Cutadapt gets very slow if you give it many adapters to trim.
Original comment by marcel.m...@tu-dortmund.de
on 10 Dec 2014 at 2:47
[deleted comment]
1. The reads are 125bp
2. There are ~33M in each file
3. I use .fastq.gz files
4. Adapter sequences are between 10 and 63 bp long
5. No wildcards
6. Yes, I think so
7. I've tried between 3 and 8 adapter sequences
When I run cutadapt on the server I use, it is considerably slower than when I
run it on my desktop machine. I imagine that's because cutadapt doesn't support
multi-threading, so the single fast CPU on my desktop will perform better than
one of the single slow CPUs it has access to on a server.
Original comment by jma1...@icloud.com
on 10 Dec 2014 at 5:00
These parameters look pretty standard to me. I forgot to ask which Python and
cutadapt version you are using. The problem could be due to gzip decompression
(and compression, in case you also write to a .fastq.gz). Python's gzip
implementation is very slow in some versions. I try to work around that in
cutadapt, but it sometimes does not work and then the slow implementation is
used. I still don't think that would explain a slowdown of the magnitude you
are observing.
Sorry about missing multithreading in cutadapt. I've tried twice to implement
it, but both times the multithreaded version turned out to be slower than the
single-threaded version, possibly due to the overhead in communication between
threads. Maybe I should have another stab at it.
One other thought: How did you install cutadapt? Do you know whether the Python
extension modules (that implement the alignment algorithm) were re-compiled?
Perhaps no optimization was used ...
Have any of your runs finished? If so, what does it say in the report under
"Time per read"? (I see 0.010ms in my single-adapter test case.) If none have
finished and you feel like it, you could prepare a smaller test file with only
100000 reads or so and then tell me how long cutadapt takes to run on that with
the same set of adapters.
Original comment by marcel.m...@tu-dortmund.de
on 10 Dec 2014 at 5:47
Hi, even if you don't have time to reply in detail, could you please tell me
which Python version you are using?
Original comment by marcel.m...@tu-dortmund.de
on 15 Dec 2014 at 4:59
Closing since there has been no reply. If this issue still applies, please add
a comment.
Original comment by marcel.m...@tu-dortmund.de
on 4 Mar 2015 at 9:38
Original issue reported on code.google.com by
jma1...@icloud.com
on 10 Dec 2014 at 1:58