marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Disable k-mer heuristic in some cases #685

Open marcelm opened 1 year ago

marcelm commented 1 year ago

Hi - I was excited to see this upgrade as we use cutadapt extensively - it works great and can do things no other software of its kind can do, but is one of the slower components of our pipeline, so we're always interested in speed improvements.

However as it turns out we actually observed a mild decrease in speed (~10%) relative to earlier versions. Our use case is mainly trimming 5' adapters from "amplicon-like" data, whereby we load files of thousands of anchored adapter sequences (-g ^file:adapters1.fa -G ^file:adapters2.fa), and every read is expected to find a match to at least one sequence in the file (and possibly non-anchored matches to several). I'm not a software engineer but I wonder if for this use case the cost overhead of the new heuristic isn't worth it (even though most adapters won't be present in any particular read).

Would it be possible to either make this functionality optional via a switch, or else automatically turn it on selectively depending on the alignment mode and/or adapter characteristics? E.g. maybe it makes more sense for for non-anchored adapters vs. anchored (since adapters are more likely to be present in applications that involve anchored adapters, like primer sequences). Or perhaps when the adapters are too short relative to the error rate, the rate of coincidental matches in reads is too high; for a 20 nt adapter sequence with an error tolerance of 10% (2 errors), the kmers are so short that coincidental matches may not be rare.

The speed decrease isn't huge, and the new software version includes other useful features; it's just a bit of a bummer that at least for us the new heuristic is a cost not a benefit.

Originally posted by @eboyden in https://github.com/marcelm/cutadapt/issues/663#issuecomment-1483998923

marcelm commented 1 year ago

Thanks for reporting! First, can you confirm that Cutadapt creates an index of the 5' primer sequences? You should get a message that says Building index of *n* adapters .... That is, does the situation in Speeding up demultiplexing apply to you? (If not, please say why, perhaps we can relax the criteria there to make it work in your case.)

How fast is Cutadapt in your case (time per read)?

It appears we just need to disable the k-mer heuristic when an index of anchored adapters is used. I’ll look into this next week. We also considered dynamically disabling the k-mer heuristic when we find that it is not worth it (when most reads contain the adapter), but that is one step further and could be done later.

By the way, try to run Cutadapt with Python 3.10 or later if you haven’t; this speeds it up by >10% in my tests.

eboyden commented 1 year ago

Using Python 3.7. Good to know about about Python 3.10+, I'll try to get that updated.

I knew we were building an index but didn't initially see the message because I was running --report minimal. I turned that off and now see Building index of 43280 adapters .... Some of those summary messages might be useful to retain even when running --report minimal.

Sample 1, 4.3:

Finished in 380.342 s (1247.153 µs/read; 0.05 M reads/minute).

Sample 1, 3.8dev1:

Finished in 374.584 s (1228.274 µs/read; 0.05 M reads/minute).

Sample 2, 4.3:

Finished in 1105.832 s (101.758 µs/read; 0.59 M reads/minute).

Sample 2, 3.8dev1:

Finished in 1090.602 s (100.356 µs/read; 0.60 M reads/minute).

So it's maybe not quite as bad as I first thought, but it's not a great benefit either.

BTW, it's great that indels are allowed in indexes now, but when loading a large file of adapter sequences even just building an indel-tolerant index can be prohibitively slow, and hence we still don't allow indels. Maybe worth a disclaimer in the manual.

rhpvorderman commented 1 year ago

When you run a benchmark the first benchmark will be slower as the file will not be cached in the linux kernel. All the benchmarks after that will be slightly faster. It is better to get a subset of the FASTQ file using cat and run a few repeated runs until the time stabilizes. As you have presented the data now, it looks like cutadapt 4.3 runs at a disadvantage because it is running the first benchmark. Regardless the slowdown in this data is:

380.342 / 374.584 1,54% slower 1105.832 / 1090.602 1,36% slower

I'm not a software engineer but I wonder if for this use case the cost overhead of the new heuristic isn't worth it (even though most adapters won't be present in any particular read

At this tiny regression of 1.54% I wonder if implementing a switch is worth it given that the switch itself also comes with a runtime and code complexity cost. For this reason a switch was left out in the initial PR.

It appears we just need to disable the k-mer heuristic when an index of anchored adapters is used. I’ll look into this next week. We also considered dynamically disabling the k-mer heuristic when we find that it is not worth it (when most reads contain the adapter), but that is one step further and could be done later.

If you add a switch in Python, this may have considerable overhead in the runtime department. Also it will add extra logic in disparate places. Simply switching of the heuristic dynamically adds the logic in the same file and should perform reasonably fast. I will have a look at this.

EDIT: Done, #686 . As shown in the post below, I am not convinced that adding it is worth it yet though.

rhpvorderman commented 1 year ago

@eboyden I just ran cutadapt twice in a row and the speedup between the runs was considerable (5% better) due to the linux caching kicking in.

Can you rerun the numbers on a subset of the data, repeating each speed test at least a few times until the time stabilizes? It would be a pity to add extra code (and therefore runtime cost) when this could be a benchmark issue.

marcelm commented 1 year ago

@eboyden Just briefly, would it be possible to share some of that data? Privately via e-mail if you want. I could create a simulated dataset if not.

eboyden commented 1 year ago

Unfortunately I don't think I can share any data. I would if I could.

I'm rerunning some of the simulations now, and honestly I'm having trouble seeing any consistent pattern in the run times - they differ by +/-2%. So the kmer heuristic definitely doesn't speed things up for my applications, but I can't definitively say it slows things down either. To @rhpvorderman's point, this may be something to keep an eye on but not do a bunch of work to address quite yet - apologies if I jumped the gun.

One question I have - how does the heuristic work in cases where an index is built? Should you still expect to see a dramatic speedup if most reads wouldn't find a match? Or is the heuristic expected to not do much with an index because you're already just pattern matching (not sure how possible errors are accounted for in the index)? Just wondering, if the heuristic is not theoretically expected to add much when matching against an index, whether it might make sense to shut it off in that case. Again, we're building an index and I'm not seeing a consistent slowdown, but nor am I seeing a consistent speedup either.

eboyden commented 1 year ago

Not sure why I initially saw such a dramatic (10%) slowdown - but rerunning the exact command with v3.8 that previously ran faster still shows the 10% slowdown, so there may be something else going on with our server.

marcelm commented 1 year ago

Unfortunately I don't think I can share any data.

No problem. Can you let me know which command-line options you used? (Of course feel free to omit sequences or replace them with made-up ones if you need, but it would be good to know their lengths.)

To create a benchmark dataset and reproduce your results, can you let me know whether it is correct that your anchored 5' adapters have a length of 20nt?

Finished in 380.342 s (1247.153 µs/read; 0.05 M reads/minute).

This is super slow and I would expect it to be faster when using an index. I wonder what is going on. If you use both -g ^file:... and -G ^file:..., I think you should actually get two messages telling you that an index was generated. Do you get that? Are some of the adapters in fact not anchored?

BTW, it's great that indels are allowed in indexes now, but when loading a large file of adapter sequences even just building an indel-tolerant index can be prohibitively slow, and hence we still don't allow indels. Maybe worth a disclaimer in the manual.

The code that generates the index is not as optimized as it could be as I expected the number of adapter sequences to be only in the dozens or maybe hundreds, not in the thousands. I’d be happy to look into this as soon as I have a way to reproduce the slowness.

eboyden commented 1 year ago

Yes I get two index messages. all adapters are anchored. The adapter set is ~44000 pairs of sequences, so the index is pretty big. Most adapters are between 15 and 25 nt long, and we use the default error tolerance of 0.1, which would allow 1-2 errors. A small number of adapters may be longer than 25 nt, and a very few (dozens?) may be 30 nt or longer, which under this error mode would allow 3 errors; accordingly I assume these would be excluded from the index.

The "super slow" sample is actually a no-DNA control, and many of the reads probably don't find a match, whereas "Sample 2" is a proper DNA sample and the overwhelming majority of reads should find a match.

Reads are Illumina NextSeq PE151; most are full length, a few are trimmed (we pass the reads through NGmerge first to remove dovetails if they exist). The input and output are piped, and here is the command:

uncompressed stdin from NGmerge | cutadapt-venv/bin/cutadapt -j 16 -u 5 -U 5 -g file:$a -G file:$b --action=none --rename='{id} RX:Z:{r1.cut_prefix}{r2.cut_prefix}\tXP:Z:{rn}~{adapter_name}' --interleaved --no-indels - 2>> $y.log | uncompressed stdout

In case the piping created problems with the timing (NGmerge is much faster than cutadapt but still), I repeated without it, reading compressed fastqs with cutadapt and writing bams:

v3.8 Sample 1: Finished in 376.267 s (1233.792 µs/read; 0.05 M reads/minute).
v4.3 Sample 1: Finished in 371.084 s (1216.795 µs/read; 0.05 M reads/minute).
v3.8 Sample 2: Finished in 782.946 s (72.046 µs/read; 0.83 M reads/minute).
v4.3 Sample 2: Finished in 789.359 s (72.636 µs/read; 0.83 M reads/minute).

So Sample 2 got a bit faster with both versions of Cutadapt, but the speed of Sample 1 didn't really change.

BTW FWIW, in the future we may be looking at situations where the number of adapter pairs is in the vicinity of 1-2M. I don't know how Cutadapt is going to handle that, but if there are obvious tweaks you can make to the indexing to make Cutadapt handle it better they would be much appreciated! (Or our infx team might be able to do a PR, but I don't want to speak on their behalf and it's not something I could do myself.)

eboyden commented 1 year ago

Yes I get two index messages. all adapters are anchored. The adapter set is ~44000 pairs of sequences, so the index is pretty big. Most adapters are between 15 and 25 nt long, and we use the default error tolerance of 0.1, which would allow 1-2 errors. A small number of adapters may be longer than 25 nt, and a very few (dozens?) may be 30 nt or longer, which under this error mode would allow 3 errors; accordingly I assume these would be excluded from the index.

The "super slow" sample is actually a no-DNA control, and many of the reads probably don't find a match, whereas "Sample 2" is a proper DNA sample and the overwhelming majority of reads should find a match.

Reads are Illumina NextSeq PE151; most are full length, a few are trimmed (we pass the reads through NGmerge first to remove dovetails if they exist). The input and output are piped, and here is the command:

uncompressed stdin from NGmerge | cutadapt-venv/bin/cutadapt -j 16 -u 5 -U 5 -g file:$a -G file:$b --action=none --rename='{id} RX:Z:{r1.cut_prefix}{r2.cut_prefix}\tXP:Z:{rn}~{adapter_name}' --interleaved --no-indels - 2>> $y.log | uncompressed stdout

In case the piping created problems with the timing (NGmerge is much faster than cutadapt but still), I repeated without it, reading compressed fastqs with cutadapt and writing bams:

v3.8 Sample 1: Finished in 376.267 s (1233.792 µs/read; 0.05 M reads/minute).
v4.3 Sample 1: Finished in 371.084 s (1216.795 µs/read; 0.05 M reads/minute).
v3.8 Sample 2: Finished in 782.946 s (72.046 µs/read; 0.83 M reads/minute).
v4.3 Sample 2: Finished in 789.359 s (72.636 µs/read; 0.83 M reads/minute).

So Sample 2 got a bit faster with both versions of Cutadapt, but the speed of Sample 1 didn't really change.

BTW FWIW, in the future we may be looking at situations where the number of adapter pairs is in the vicinity of 1-2M. I don't know how Cutadapt is going to handle that, but if there are obvious tweaks you can make to the indexing to make Cutadapt handle it better they would be much appreciated!

marcelm commented 1 year ago

Thanks. I’m busy with other things, but want to look into this as soon as possible. In the meantime, maybe you could do an experiment: You could increase the maximum number of allowed errors for an adapter to be included in the index by editing the file cutadapt-venv/lib/python3.9/site-packages/cutadapt/adapters.py. Search for k > 2 (around line 1278) and change it to k > 3. Creating the index would be slower (not sure by how much), but trimming itself should run faster.

Also, the command-line that you wrote says -g file:$a. This is different from what you wrote in your first message. Just because the anchoring is so important here: Is it correct that the individual sequences in the file are preceded by a ^ character?

Looking up an adapter in the index should only take a couple of microseconds at most, so I still don’t know what to make of the >1000 µs/read. I can explain this only if a good chunk of adapters are not included in the index. (These would then be looked up one by one until a match has been found.)

eboyden commented 1 year ago

Sorry for the confusion: v3.8 doesn't support the ^file nomenclature, so in cases where I didn't use it, I used a file of anchored sequences instead; I assume it should be functionally equivalent.

The adapter set is the same for Sample1 and Sample2, so the lookup time should presumably be the same; but the nature of the reads themselves may be substantially different. What if Sample1 contains many more Ns in the reads? Or are less likely to match an adapter? Or could there be another reason why the reads themselves would slow the process down, rather than the adapter sequences?

Or is it possible that the total time includes some overhead (like building the index), which is then averaged out over the per-read processing time? Sample2 contains many more reads than Sample1, so the effect of this overhead would be relatively minimal.

Thanks for the k > 3 suggestion, I'll give that a try.

eboyden commented 1 year ago

input fq.gz > Cutadapt 4.3 k > 2 > output fq.gz

Sample1

real    7m5.516s
user    8m41.962s
sys     1m47.557s

Finished in 368.304 s (1207.680 µs/read; 0.05 M reads/minute).

Sample2

real    13m40.834s
user    97m9.834s
sys     22m37.783s

Finished in 762.833 s (70.195 µs/read; 0.85 M reads/minute).

input fq.gz > Cutadapt 4.3 k > 3 > output fq.gz

Sample1

real    7m19.414s
user    8m48.898s
sys 1m46.593s

Finished in 378.683 s (1241.714 µs/read; 0.05 M reads/minute).

Sample2

real    15m16.205s
user    102m30.183s
sys 24m9.249s

Finished in 833.198 s (76.670 µs/read; 0.78 M reads/minute).

So maybe a bit slower with k > 3, both in terms of real time and processing time per read. Maybe consistent with overhead getting rolled into the read processing time?

For massive indexes that might take a long time to build: one option might be to provide a user the ability to write an index to file, if they expect to be running the same set of adapters over and over again - similar to a genome index for aligners.

I'm adding one of our software engineers to this conversation (which at this point probably merits its own issue) - he might be willing to take a look and help out with a PR if he has any ideas. @JoeVieira

marcelm commented 1 year ago

Sorry for the confusion: v3.8 doesn't support the ^file nomenclature, so in cases where I didn't use it, I used a file of anchored sequences instead; I assume it should be functionally equivalent.

Thanks, of course you’re right. I forgot I added this notation only recently.

Or is it possible that the total time includes some overhead (like building the index), which is then averaged out over the per-read processing time? Sample2 contains many more reads than Sample1, so the effect of this overhead would be relatively minimal.

I had to check and you are right: The "Finished in x s (y µs/read ...)" message is the wall-clock time from the start of the program to the end (excluding the time it takes to print the report, but that should be fast).

For massive indexes that might take a long time to build: one option might be to provide a user the ability to write an index to file, if they expect to be running the same set of adapters over and over again - similar to a genome index for aligners.

Yes, but my preferred option would be to just make the indexing a lot faster. This should be quite doable. I’m helping out with a new read-aligner strobealign, which can index a human-sized genome in under three minutes, so maybe I’m a bit optimistic, but I think indexing 100'000 adapter sequences should be doable within a couple of seconds. The code is not very efficient at the moment as it was good enough for the numbers of adapters it was originally intended for.

marcelm commented 1 year ago

I’ve done a couple of optimizations and was able to make indexing when --no-indels is used faster by a factor 1.75 (17 s instead of 30 s for a set of 10'000 5' adapters). Additional changes require more effort and will need to wait a bit until I have the time.

A note: I incorrectly said that I saw a speedup of >10% using Python 3.10. I actually meant Python 3.11.

The indexing speedup doesn’t solve the issue that your overall runtime is unexpectedly high. While working on the code, I noticed that it falls back to the slow way when of trying each adapter when a read contains an N base in the beginning (where the adapter would be). Do you reads possibly contain some Ns?

marcelm commented 1 year ago

And BTW I did not note any significant contribution to runtime from the k-mer heuristic, so I don’t think we need to make any changes there.

eboyden commented 1 year ago

A few reads in the NTC contain Ns at the beginning, but it's <0.1% (584 / 609936). And the DNA sample reads include Ns at a similar rate (20991 / 21734558).

If the time per read is based on the wall clock time, isn't it possible that the index build time, however modest it is, still has an appreciable effect on the read processing time when there are relatively few reads and the index is unusually large? Perhaps the index build time should merit its own mention in the summary statement, and the total read processing time shouldn't begin until the index is built and read processing starts - not sure how complicated that would be to change though.

marcelm commented 1 year ago

A few reads in the NTC contain Ns at the beginning, but it's <0.1% (584 / 609936).

Still, the slowdown is so extreme that this may be responsible for a sizeable chunk of the runtime.

When an index is used, no matter the size, looking up the correct adapter takes roughly 5µs. It takes about the same time to search for a single adapter. Processing a read with an N in the beginning therefore takes up to 0.2 s (!) because all 43280 adapters are searched ony by one. With 584 reads that have an N, that alone would take 126 CPU seconds.

As a test, you could preprocess your reads with cutadapt --max-n 0 (this would need to be a separate cutadapt invocation since this type of filtering is done after adapter trimming) and see whether that makes things faster.

I will also need to make the index work with reads that contain N.

If the time per read is based on the wall clock time, isn't it possible that the index build time, however modest it is, still has an appreciable effect on the read processing time when there are relatively few reads and the index is unusually large? Perhaps the index build time should merit its own mention in the summary statement, and the total read processing time shouldn't begin until the index is built and read processing starts - not sure how complicated that would be to change though.

Yes, it may make sense to split this. As a first step, I included the index build time in the message shown after the index is built:

Built an index containing 1222483 strings in 0.9 s.

Feel free to test the development version if you want, which would also give you the --no-indels indexing speedup: https://cutadapt.readthedocs.io/en/stable/installation.html#installing-the-development-version (But no worries if you cannot.)

marcelm commented 1 year ago

I’ve managed to speed up index generation when indels are allowed by about a factor of three. It’s still about 10 times slower than the (sped up) --no-indels version, so maybe not usable in your case, yet, but given that the index is also about six times bigger, it takes only slightly longer per generated string.

eboyden commented 1 year ago

Thanks for working on this...

First, I apologize but I made a mistake earlier when I was conducting the speed tests - I thought I was comparing 4.3 to 3.8 but I was actually running 4.3 multiple times.

New speed test:

4.3, k > 3

Sample1

real    8m34.246s
user    8m47.473s
sys 1m41.816s

Finished in 454.200 s (1489.337 µs/read; 0.04 M reads/minute).

Sample2

real    12m15.937s
user    51m6.234s
sys 9m58.945s

Finished in 663.698 s (61.073 µs/read; 0.98 M reads/minute).

4.3, k > 2

Sample1

real    7m53.122s
user    9m10.521s
sys 2m13.548s

Finished in 412.851 s (1353.751 µs/read; 0.04 M reads/minute).

Sample2

real    13m41.391s
user    95m50.884s
sys 21m43.652s

Finished in 761.676 s (70.089 µs/read; 0.86 M reads/minute).

3.8, k > 2

Sample1

real    6m25.823s
user    7m29.919s
sys 1m24.187s

Finished in 330.20 s (1083 us/read; 0.06 M reads/minute).

Sample2

real    10m48.436s
user    72m57.797s
sys 13m46.164s

Finished in 598.29 s (55 us/read; 1.09 M reads/minute).

So this suggests that k > 3 does indeed speed things up overall when processing enough reads (Sample2), but it's not worth the tradeoff in index building time when processing relatively few reads (Sample1). It also suggests that the k-mer heuristic (or some other difference between 3.8 and 4.3) is indeed slowing things down, at least for my use case. Note this was just cutadapt reading and writing fastqs, nothing else in my pipeline.

eboyden commented 1 year ago

4.4.dev21+gf9397b7, k > 2

Sample1

real    4m13.141s
user    5m42.511s
sys 1m47.787s

Built an index containing 68243443 strings in 74.7 s.
Built an index containing 51346514 strings in 52.2 s.
Finished in 210.504 s (690.251 µs/read; 0.09 M reads/minute).

Sample2

real    10m28.457s
user    99m7.967s
sys 18m22.125s

Built an index containing 68243443 strings in 74.4 s.
Built an index containing 51346514 strings in 52.6 s.
Finished in 588.381 s (54.142 µs/read; 1.11 M reads/minute).

So definitely considerably faster. Looks like it's taking roughly 2 minutes to build both indices - which in the case of Sample1 is approximately half the wall clock time.

eboyden commented 1 year ago

4.4.dev21+gf9397b7, k > 3

Sample1

real    4m39.216s
user    5m2.144s
sys 1m27.477s

Built an index containing 92559407 strings in 105.5 s.
Built an index containing 52255522 strings in 57.3 s.
Finished in 237.129 s (777.555 µs/read; 0.08 M reads/minute).

Sample2

real    8m9.275s
user    47m18.398s
sys 9m50.037s

Built an index containing 92559407 strings in 106.5 s.
Built an index containing 52255522 strings in 57.9 s.
Finished in 437.790 s (40.285 µs/read; 1.49 M reads/minute).

So using k > 3 added about 40" total to the index creation time - which resulted in a mild slowdown of Sample1, but greatly sped up Sample2, due to how many more reads there are in Sample2.

4.4.dev21+gf9397b7, k > 2, no Ns in reads

Sample1

real    4m11.301s
user    5m32.277s
sys 1m37.363s

Built an index containing 68243443 strings in 74.3 s.
Built an index containing 51346514 strings in 52.0 s.
Finished in 196.659 s (647.487 µs/read; 0.09 M reads/minute).

Sample2

real    9m4.767s
user    82m56.624s
sys 15m47.041s

Built an index containing 68243443 strings in 73.5 s.
Built an index containing 51346514 strings in 52.0 s.
Finished in 509.594 s (47.080 µs/read; 1.27 M reads/minute).

So interestingly, disallowing Ns in the reads had almost no effect on Sample1, but did seem to speed up Sample2 a bit. Sample 1 only had 1241 read pairs removed (0.4%), vs. 43307 read pairs removed for Sample2 (also 0.4%).

I should note that all of these trials used --no-indels.

marcelm commented 1 year ago

Thanks for testing! Can you please double check the Cutadapt version you are comparing against? I realized only now that there’s no version 3.8. Version 3.7 was followed by 4.0.

I noticed also that the k-mer heuristic is actually already disabled for 5' adapters when --no-indels is used, so this should not be responsible for any slowdown.

I will systematically test old versions of Cutadapt up to the most recent one to see whether I can determine what made the command you are using slower. Apart from the k-mer heuristic, a couple of other things have changed, such as the way --rename works.

eboyden commented 1 year ago

It's a development version, cutadapt-3.8.dev1+g3852ab0 - presumably what would eventually become 4.0.

Is the k-mer heuristic disabled with 5' (anchored?) adapters and --no-indels only in the newest 4.4.dev version, or has that been true since 4.3?

eboyden commented 1 year ago

Oh and maybe also worth mentioning is that 3.8.dev1 version was built with Python 3.6.15; all the later versions were built with Python 3.7.16. So I don't know if there was a Python slowdown between 3.6 and 3.7.

marcelm commented 1 year ago

It's a development version, cutadapt-3.8.dev1+g3852ab0 - presumably what would eventually become 4.0.

Ah, that explains it. That particular version is just one commit ahead of 3.7, so it’s essentially 3.7.

Is the k-mer heuristic disabled with 5' (anchored?) adapters and --no-indels only in the newest 4.4.dev version, or has that been true since 4.3?

That has been true since the k-mer heuristic was introduced in 4.3.

Oh and maybe also worth mentioning is that 3.8.dev1 version was built with Python 3.6.15; all the later versions were built with Python 3.7.16. So I don't know if there was a Python slowdown between 3.6 and 3.7.

Ok, I’ll take that into account when testing. There are lots of variables now ...

marcelm commented 1 year ago

I benchmarked Cutadapt 3.7-4.3 on Python 3.7-3.11 using data similar in structure to yours and nearly the same command line as you provided, but only 2000 adapters and only 250'000 read pairs. I did not test Python 3.6 as that is end of life and no longer supported by recent Cutadapt versions. The shown time is the minimum runtime in seconds over seven nine runs (the variance is quite high, so don’t trust the digits after the decimal point too much).

Python version Cutadapt v3.7 Cutadapt v4.0 Cutadapt v4.1 Cutadapt v4.2 Cutadapt v4.3 Cutadapt f9397b7ca9eee5e089f5eadcf9b4295e766c0318
3.7 39.87 39.67 39.67 38.72 44.50 37.47
3.8 31.49 30.69 30.60 30.17 32.91 27.42
3.9 30.89 30.33 30.37 30.04 32.98 28.03
3.10 30.79 30.19 30.14 29.85 31.34 26.22
3.11 27.93 26.39 26.35 26.19 27.60 24.52

The first observation is that there is indeed some consistent slowdown from 4.2 to 4.3 (I plan to investigate this further).

However, the big conclusion to me is: If you care about performance, use a more recent Python version. Just switching from Python 3.7 to 3.8 more than offsets the slowdown from 4.2 to 4.3.

Edit: Added a column for main (commit f9397b7ca9eee5e089f5eadcf9b4295e766c0318). 2nd edit: Nine rounds instead of seven

eboyden commented 1 year ago

Definitely interesting...time to update Python I guess

marcelm commented 1 year ago

I made one last improvement.

Just like your data, the data that I generated for the benchmark includes N bases (within the part of the read that matches the adapter). As I described above, this causes Cutadapt to fall back to the slow one-by-one matching algorithm.

Starting from commit 7adfc5de4c2ea695993bf87dba9b8b0ddeaf6449, the index is now used even if there are any N wildcards.

I simplified the benchmark a bit because I did not want to wait as long this time, and the results look like this:

Python version Cutadapt v3.7 Cutadapt v4.0 Cutadapt v4.1 Cutadapt v4.2 Cutadapt v4.3 Cutadapt f9397b7ca9eee5e089f5eadcf9b4295e766c0318 Cutadapt 7adfc5de4c2ea695993bf87dba9b8b0ddeaf6449
3.7 19.99 18.89 19.33 19.98 21.86 19.45 11.73
3.8 14.97 14.39 14.36 14.32 15.56 12.65 8.99
3.9 14.54 14.47 14.26 13.99 15.52 12.99 9.00
3.10 14.52 14.43 14.17 14.03 14.70 12.15 8.60
3.11 13.22 12.53 12.49 12.39 13.07 11.29 8.90

It would be great if you could test this most recent commit. You may or may not see a speedup, depending on your data. If it works, I can release a new Cutadapt version soon.

That’s as much as I am going to work on this for now. Please let me know if the issue can be closed (or close it yourself). I listed the remaining ideas that I have over at #690 so they won’t be forgotten.

eboyden commented 1 year ago

I will - Python 3.11 keeps fighting my efforts to install it, but I can at least test it with 3.7, and I'll post the results here then close the issue if I don't see anything unexpected.

Did you track down the source of the slowdown from 4.2 to 4.3?

eboyden commented 1 year ago

I finally managed to install Python 3.11 and began running simulations, but this is as far as I got: Cutadapt 4.4.dev30+g9d61c9e, Python 3.7.16 cutadapt-4.4.dev30+g9d61c9e-py3.7.16/bin/cutadapt -j 16 -u 5 -U 5 -g ^file:$a -G ^file:$b --action none --rename '{id} RX:Z:{r1.cut_prefix}{r2.cut_prefix}\tXP:Z:{rn}~{adapter_name}' --interleaved --no-indels $y.fq.gz -o $z.fq.gz &>> $z.log

Sample1

real    17m35.916s
user    52m12.846s
sys 44m56.334s

Built an index containing 68243443 strings in 74.8 s.
Built an index containing 51346514 strings in 52.4 s.
Built an index containing 68243443 strings in 185.8 s.
Built an index containing 68243443 strings in 196.9 s.
Built an index containing 68243443 strings in 196.9 s.
Built an index containing 68243443 strings in 199.8 s.
Built an index containing 68243443 strings in 198.9 s.
Built an index containing 68243443 strings in 203.6 s.
Built an index containing 68243443 strings in 203.4 s.
Built an index containing 68243443 strings in 205.2 s.
Built an index containing 68243443 strings in 217.8 s.
Built an index containing 68243443 strings in 312.4 s.
Built an index containing 68243443 strings in 337.2 s.
Built an index containing 68243443 strings in 546.2 s.
Built an index containing 68243443 strings in 580.4 s.
Built an index containing 68243443 strings in 629.6 s.
Built an index containing 68243443 strings in 632.8 s.
Built an index containing 68243443 strings in 632.6 s.

Sample2

real    13m36.111s
user    47m48.252s
sys 40m33.057s

Built an index containing 68243443 strings in 66.9 s.
Built an index containing 51346514 strings in 46.8 s.
Built an index containing 68243443 strings in 148.2 s.
Built an index containing 68243443 strings in 160.1 s.
Built an index containing 68243443 strings in 160.1 s.
Built an index containing 68243443 strings in 166.1 s.
Built an index containing 68243443 strings in 167.5 s.
Built an index containing 68243443 strings in 169.8 s.
Built an index containing 68243443 strings in 172.1 s.
Built an index containing 68243443 strings in 179.0 s.
Built an index containing 68243443 strings in 180.1 s.
Built an index containing 68243443 strings in 252.6 s.
Built an index containing 68243443 strings in 330.0 s.
Built an index containing 68243443 strings in 335.9 s.
Built an index containing 68243443 strings in 365.6 s.
Built an index containing 68243443 strings in 365.9 s.
Built an index containing 68243443 strings in 365.2 s.
Built an index containing 68243443 strings in 418.4 s.
Built an index containing 51346514 strings in 91.6 s.
Built an index containing 51346514 strings in 96.4 s.
Built an index containing 51346514 strings in 129.1 s.

And both samples eventually crashed due to a Memory Error. Same result with Python 3.11.3 except it crashed faster.

Sample1

real    11m40.559s
user    42m54.737s
sys 41m59.665s

Sample2

real    10m49.839s
user    38m40.697s
sys 44m25.772s

I don't know if building multiple indices per sample is a bug or intended behavior, but I'm guessing the former?

marcelm commented 1 year ago

Thanks, good to know. This is due to some unrelated changes that I made over the last days, which affect multicore mode. Please use commit 7adfc5de4c2ea695993bf87dba9b8b0ddeaf6449 while I try to figure out how to fix this. I’ll revert the changes if I cannot find a solution.

Note to self:

eboyden commented 1 year ago

Thanks for alerting me to that. Here are the latest successful simulations: Cutadapt 4.4.dev22+g7adfc5d, Python 3.7.16

Sample1

real    3m24.927s
user    5m5.738s
sys     0m53.359s

Built an index containing 68243443 strings in 66.9 s.
Built an index containing 51346514 strings in 48.0 s.
Finished in 172.830 s (566.717 µs/read; 0.11 M reads/minute).

Sample2

real    8m2.702s
user    77m57.118s
sys     6m17.657s

Built an index containing 68243443 strings in 66.9 s.
Built an index containing 51346514 strings in 46.5 s.
Finished in 447.768 s (41.203 µs/read; 1.46 M reads/minute).

Cutadapt 4.4.dev22+g7adfc5d, Python 3.11.3

Sample1

real    3m21.070s
user    4m15.217s
sys     0m48.446s

Built an index containing 68243443 strings in 77.4 s.
Built an index containing 51346514 strings in 61.1 s.
Finished in 171.375 s (561.943 µs/read; 0.11 M reads/minute).

Sample2

real    5m53.097s
user    46m15.771s
sys     6m39.444s

Built an index containing 68243443 strings in 67.2 s.
Built an index containing 51346514 strings in 56.2 s.
Finished in 311.468 s (28.661 µs/read; 2.09 M reads/minute).

So it's certainly faster than it used to be, and Python 3.11 provides an extra boost, but only for Sample2, which is consistent with it helping the read processing but not the index building. Sample1 is still relatively slow per read, but that's because there are few reads so the index building is ~60% of the total run time.

Cutadapt 4.4.dev22+g7adfc5d, Python 3.7.16, k > 3

Sample1

real    3m59.043s
user    4m17.661s
sys     0m52.348s

Built an index containing 92559407 strings in 96.0 s.
Built an index containing 52255522 strings in 52.0 s.
Finished in 203.140 s (666.101 µs/read; 0.09 M reads/minute).

Sample2

real    6m13.398s
user    28m14.623s
sys     2m23.643s

Built an index containing 92559407 strings in 94.0 s.
Built an index containing 52255522 strings in 51.8 s.
Finished in 322.329 s (29.660 µs/read; 2.02 M reads/minute).

Cutadapt 4.4.dev22+g7adfc5d, Python 3.11.3, k > 3

Sample1

real    4m11.745s
user    4m20.787s
sys     1m9.049s

Built an index containing 92559407 strings in 126.1 s.
Built an index containing 52255522 strings in 59.7 s.
Finished in 215.957 s (708.129 µs/read; 0.08 M reads/minute).

Sample2

real    5m35.048s
user    25m4.345s
sys     3m39.311s

Built an index containing 92559407 strings in 103.1 s.
Built an index containing 52255522 strings in 56.2 s.
Finished in 291.057 s (26.783 µs/read; 2.24 M reads/minute).

As previously observed, k > 3 speeds up the DNA sample but slows down the NTC. And the DNA speedup is much more pronounced with Python 3.7 than with Python 3.11.

Cutadapt 4.4.dev22+g7adfc5d, Python 3.11.3, k > 2, no Ns

Sample1

real    3m4.328s
user    4m2.031s
sys     0m47.019s

Built an index containing 68243443 strings in 67.6 s.
Built an index containing 51346514 strings in 56.8 s.
Finished in 155.229 s (511.081 µs/read; 0.12 M reads/minute).

Sample2

real    5m51.682s
user    45m44.952s
sys     6m8.711s

Built an index containing 68243443 strings in 68.8 s.
Built an index containing 51346514 strings in 56.6 s.
Finished in 312.744 s (28.894 µs/read; 2.08 M reads/minute).

Cutadapt 4.4.dev22+g7adfc5d, Python 3.11.3, k > 3, no Ns

Sample1

real    3m48.621s
user    4m11.466s
sys     0m55.785s

Built an index containing 92559407 strings in 104.1 s.
Built an index containing 52255522 strings in 56.7 s.
Finished in 192.624 s (634.201 µs/read; 0.09 M reads/minute).

Sample2

real    5m38.538s
user    25m5.516s
sys     3m46.100s

Built an index containing 92559407 strings in 103.2 s.
Built an index containing 52255522 strings in 56.1 s.
Finished in 291.977 s (26.975 µs/read; 2.22 M reads/minute).

Disallowing Ns does not have an appreciable effect on the DNA sample time, but still speeds up the NTC modestly. For whatever reason, it looks like with Python 3.11 (but not Python 3.7), the read 1 index takes longer to build for the NTC than for the DNA sample, but only when Ns are present in reads - so this is where the speedup is coming from. Weird that it would only affect the NTC since the index building should be identical across samples, no?

In any case, this is a great improvement, I'm impressed. And looking forward to the further refinements you're working on.

marcelm commented 1 year ago

Thanks a lot for the measurements. I think they look consistent with the improvements I made, and they make sense with the knowledge that indexing time is (currently) included in the time per read.

The CPU time per read is still quite high in the unmodified Cutadapt version, which appears to be caused by the adapters that are not included in the index because 3 errors are allowed for them. Since allowing them to be included in the index with the k > 3 modification makes the indexing so much slower, an alternative way to improve the speed would be to reduce the number of allowed errors for these to 2, either by reducing the overall maximum error rate or by adding ;e=2 to just the long sequences in the FASTA file.

Cutadapt’s default maximum error rate of 10% is quite high, and it is that way only because the regular adapter matching algorithm isn’t slowed down by a high error rate that much.

BTW, I have reverted the changes that made index creation crash. They will likely not be part of the next Cutadapt release.