Closed kloetzl closed 1 year ago
Thank you for investigating this!
The templates shouldn't matter too much. The compiler is pretty good at resolving them into normal for loops, but saying that views are not completely cost free.
The real bottleneck is that we internally use look-up tables to convert char
to seqan3::dna4
and the same for reverse complementing a seqan3::dna4
. Look-up tables are inherently bad for SIMD and I have stated several time that arithmetic operations might be better for exactly this use case (I just talked about this today with @rrahn). Since we would benefit more with overall performance if we would use SIMD instead of single core performance.
I never had the time to validate those claims, because we are currently stabilizing the public interfaces.
So it is fascinating that you are validating my gut feeling :)
Our internal rank representation is currently 0: A
, 1: C
, 2: G
, 3: T
, so 3 - dna4
letter is indeed the correct way to calculate the reverse complement. It would be interesting to see if we could find a similar formula for seqan3::dna15 (i.e. IUPAC)
.
It would be nice if you could try this patch out:
diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
}
//!\}
+ constexpr dna4 & assign_char(char_type const c) noexcept
+ {
+ char_type upper_char = c & 0b0101'1111; // to_upper
+ assign_rank((upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C'));
+ return *this;
+ }
+
+ constexpr dna4 complement() const noexcept
+ {
+ return dna4{}.assign_rank(to_rank() ^ 0b11);
+ }
+
protected:
//!\privatesection
(This won't do any look-ups and will only use arithmetic operations, which should be auto-vectorisable)
This is your benchmark, right?
auto s = std::string{forward};
auto s_as_dna = s | seqan3::views::char_to<seqan3::dna4>;
for (auto _ : state) {
auto revcomp =
s_as_dna | std::views::reverse | seqan3::views::complement;
// force evaluation of view
for (auto foo : revcomp)
benchmark::DoNotOptimize(foo);
}
Can you try this variant for me (after you applied the patch)?:
auto s = std::string{forward};
for (auto _ : state) {
auto revcomp =
s | std::views::reverse | seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement;
// force evaluation of view
for (auto foo : revcomp)
benchmark::DoNotOptimize(foo);
}
For cache behaviour I would first reverse and when do the char transformation. These are all lazy operations and the order shouldn't matter too much, but I would expect that doing it in this order should be tiny bit faster.
The compiler is pretty good at resolving them into normal for loops.
That is the issue here. Even with my patch applied the code is compiled into a normal, scaler loop, not a (auto-)vectorized one despite full optimizations.
It would be interesting to see if we could find a similar formula for seqan3::dna15 (i.e. IUPAC).
On x86_64 you can use PSHUFB
for parallel table lookups. Here is an example how this can be done. However, that makes it less portable.
It would be nice if you could try this patch out: …
That makes it much slower than my patch and your original version. (Changing the benchmarking code makes no big difference.)
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
dnax_revcomp 1053474 ns 1023856 ns 679
bwa_bench 185624 ns 181061 ns 3902
bench/dna4_revcomp 186789 ns 176443 ns 3970
bench_seqan3_dna4 3610707 ns 3493823 ns 199
Ooops. Turns out my benchmarking code was wrong. Because I DoNotOptimize(foo) I basically force byte-wise generation. Storing the result in a vector is properly vectorized.
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
dnax_revcomp 1066001 ns 1027934 ns 657
bwa_bench 184196 ns 178005 ns 3922
bench/dna4_revcomp 180400 ns 174404 ns 4028
original (master):
bench_seqan3_dna4 734305 ns 713637 ns 976
bench_seqan3_dna4_vector 800214 ns 777704 ns 908
my patch:
bench_seqan3_dna4 679484 ns 663801 ns 1055
bench_seqan3_dna4_vector 847492 ns 832866 ns 844
your patch:
bench_seqan3_dna4 3791564 ns 3716924 ns 195
bench_seqan3_dna4_vector 130958 ns 127241 ns 5311
So your new patch is really really fast. In fact it is much faster than I would expect it to be as it does a lot per loop iteration.
Phew. I thought I'm stupid xD
A lot in a loop does not mean much, if it can pipeline a lot. (Which should be the case here)
Thank you for bringing this to our attention and I think it is extremely valuable to have this kind of constructive feedback.
I was disconnected form internet today:
I wanted to try this patch next
diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..4c224419f 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -47,11 +47,11 @@ class rna4;
* If the special char conversion of IUPAC characters is not your desired behavior, refer to our cookbook for an
* example of \ref cookbook_custom_dna4_alphabet to change the conversion behavior.
*/
-class dna4 : public nucleotide_base<dna4, 4>
+class dna4 : public nucleotide_base<dna4, 256>
{
private:
//!\brief The base class.
- using base_t = nucleotide_base<dna4, 4>;
+ using base_t = nucleotide_base<dna4, 256>;
//!\brief Befriend seqan3::nucleotide_base.
friend base_t;
@@ -82,6 +82,44 @@ public:
}
//!\}
+ static uint8_t constexpr alphabet_size = 4;
+
+ constexpr dna4 & assign_rank(rank_type const c) noexcept
+ {
+ base_t::assign_rank(c == 0 ? 'A' : (c == 1 ? 'C' : (c == 2 ? 'G' : 'T')));
+ return *this;
+ }
+
+ constexpr dna4 & assign_char(char_type const c) noexcept
+ {
+ base_t::assign_rank(c);
+ return *this;
+ }
+
+ constexpr char_type to_char() const noexcept
+ {
+ return base_t::to_rank();
+ }
+
+ constexpr rank_type to_rank() const noexcept
+ {
+ char_type c{to_char()};
+ char_type upper_char = c & 0b0101'1111; // to_upper
+ rank_type rank{};
+ rank = (upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C');
+ return rank;
+ }
+
+ constexpr dna4 complement() const noexcept
+ {
+ char_type rank{to_char()};
+ rank ^= rank % 2 ? ('C' ^ 'G') : ('A' ^ 'T');
+
+ dna4 ret{};
+ static_cast<base_t &>(ret).assign_rank(rank);
+ return ret;
+ }
+
protected:
//!\privatesection
which is basically the same as not converting a char when reading it in and doing your pretty nifty trick in the complement with the conditional xor. :)
@kloetzl Can you tell me the tool which you used to visualise the assembler / opcode? Thank you! :)
Those are the Linux perf tools. Just perf record ./command --flags
and it takes a profile. With perf report
you get an overview of all functions and how much time is spent in each. You can then enter each function and inspect individual instructions. If you supply -ggdb
you even get the instructions interleaved with the source.
Btw, next time you tag a release please add the -a
flag. Otherwise git describe uses 3.0.0 as reference.
(EDIT marehr: This will be tracked at https://github.com/seqan/product_backlog/issues/159)
Found a bug in my code. Here are some new numbers using this patch.
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
dnax_revcomp 1060246 ns 1008337 ns 685
bwa_bench 135160 ns 134486 ns 6061
bench/dna4_revcomp 138323 ns 137878 ns 5031
bench/dna4_revcomp_inline 108384 ns 106274 ns 5825
bench_seqan3_dna4 3618904 ns 3523777 ns 198
bench_seqan3_dna4_vector 134825 ns 129857 ns 5182
I recomputed the table above with the latest master on my old laptop.
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
dnax_revcomp 1022730 ns 1020968 ns 572
bwa_bench 99187 ns 98739 ns 7299
bench/dna4_revcomp 92767 ns 92545 ns 7401
bench/twiddle 101125 ns 100873 ns 7023
bench_seqan3_dna4 701982 ns 701122 ns 1013
bench_seqan3_dna4_vector 804092 ns 772922 ns 922
I recomputed the table above with the latest master on my old laptop.
-------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- dnax_revcomp 1022730 ns 1020968 ns 572 bwa_bench 99187 ns 98739 ns 7299 bench/dna4_revcomp 92767 ns 92545 ns 7401 bench/twiddle 101125 ns 100873 ns 7023 bench_seqan3_dna4 701982 ns 701122 ns 1013 bench_seqan3_dna4_vector 804092 ns 772922 ns 922
Thank you for reminding :) I guess those numbers are without the proposed patch.
Maybe I add some progress report:
We started to change our alphabet_base implementation to allow any kind of char
to rank
and rank
to char
conversion method. The same will be done for the complement implementation until the next release. This allows implementing a dna4_simd
alphabet which uses auto-vectorisable arithmetic operations, like the one I proposed in this Patch.
One of the major problems we are facing right now, is how to provide a sane interface that provides the best implementation in both the generic and simd-aware use cases.
Some benchmarks showed that switching from look-up tables to arithmetic operations can have a drastic performance penality with SIMD disabled. Also use cases that only call seqan3::assign_char_to
occasionally, will have slower performance.
It seems like enabling SIMD computation will need some special care, in the past I thought that adding a special view
std::vector<char> sequence{};
auto view = sequence | std::views::reverse | seqan3::views::simdify(seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement);
might be a viable solution, but I'm not completely sure if that is still a good idea.
Right now I think changing the way how the API is called might be a good first step, e.g.
// single conversion, non-SIMDifiable
// always uses lookup-tables
dna4::char_to_rank('A') == 0;
using rank_type = seqan3::alphabet_rank_t<seqan3::dna4>;
std::vector<char> sequence{'A', 'C', 'G', 'T'};
std::vector<rank_type> ranks(4, 0); // same size as sequence
// std::span is contiguous memory, multiple conversions, SIMDifiable
// uses lookup-tables without any optimisations, uses SIMD-operations when using SSE4, ... etc.
dna4::char_to_rank(std::span{ranks} /*out-param*/, std::span{sequence} /*in-param*/);
std::vector<rank_type> complement_ranks(4, 0);
dna4::rank_complement(std::span{complement_ranks} /*out-param*/, std::span{ranks});
std::vector<char> complement_sequence(4, 'A');
dna4::char_complement(std::span{complement_sequence} /*out-param*/, std::span{sequence});
This would allow to provide a more "C" like interface to work directly on contiguous, raw-pointers and I guess would be a good building-block to provide a high-level interface such as seqan3::views::simdify
.
2022 update, numbers still look the same:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
dnax_revcomp 778646 ns 774001 ns 887
bwa_bench 97834 ns 97409 ns 6986
bench/dna4_revcomp 92363 ns 91299 ns 7631
bench_seqan3_dna4 749998 ns 711614 ns 974
bench_seqan3_dna4_vector 810274 ns 782115 ns 889
If the small patch by @marehr up there (https://github.com/seqan/seqan3/issues/1970#issuecomment-654479994) already improves performance so much, can we just apply that, even if we don't have a more general solution for the other alphabets, yet?
edit: ah, I see, it slows things down for the non-vectorised use-case, right? May be just `#ifdef' around it?
Although from @marehr's last comment it doesn't feel like a good solution to just patch it, I would have to look deeper into this, which #ifdef
would you propose? An extra macro that you can enable/disable with Cmake? This would be a lot of infrastructure for such a small thing.
which #ifdef would you propose?
I would just check for AVX2:
#ifdef __AVX2__
But before adding, it would probably be good to have a benchmark included, as well. What was the latest benchmark code you were using, @kloetzl ?
I wouldn't add this as a #ifdef
. Why do we have the alphabet concept?
My patch is just good for a certain use case and without carefully crafting your code that makes use of this patch (mainly auto-vectorizable loops on contiguous memory), you will have a performance degeneration for all other use cases (most user code will NOT be in this form).
I think the more appropriate option would be to add special alphabets for exactly this use case:
seqan3::simd::dna4_rank dna4_1; // (rank based implementation, my patch)
seqan3::simd::dna4_char dna4_2; // (char based implementation, kloetzl implementation)
This would just be one part, the second part would be the question of how to efficiently have sequence operations/algorithms, like e.g. "reverse_complement".
But, lacking an application that uses this, I don't know what the best API would be.
As I wrote in https://github.com/seqan/seqan3/issues/1970#issuecomment-814113239, my gut feeling is that having "algorithmic" functions (like in #include <algorithm>
) would be the best way to encapsulate "simd" functionality. This is a bit contrary to our views-only approach, as it is more low-level.
I agree with @marehr that we should not aim for an easy-patch solution when we do not have a use case at hand. (I don't count a specific benchmark as use case).
@h-2 An #ifdef
checking merly for the availability of SIMD will most probably decrease performance in a lot of existing code.
Regarding the extra alphabet, we could maybe add a cookbook entry for now that basically provides an efficient dna4 simd alphabet with @marehr s patch applied. Or maybe a documentation entry in the alphabt module page or even an how-to if someone has the time. With a note that the user interested in it should report back to us if he/she has a specific use case.
Re-reading this thread, kloetzl proposed a "better" complement implementation for dna4 / rna4.
He stated that
Adding the following function to dna4 makes complementing it about 16% faster. https://github.com/seqan/seqan3/issues/1970#issue-651799129
Currently, we use lookup tables for reverse complement,
and kloetzl's patch
public:
constexpr dna4 complement() const noexcept
{
return dna4{}.assign_rank(3 - to_rank());
}
or my patch
diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
}
//!\}
+ constexpr dna4 complement() const noexcept
+ {
+ return dna4{}.assign_rank(to_rank() ^ 0b11);
+ }
would use arithmetic operations instead. As both are "single" instruction solutions, they should always be at least as good as the lookup table (for all use cases).
See https://godbolt.org/z/TKKoEv1bc for generated instructions.
So I would suggest changing dna4's complement implementation from lookup to xor. (We/I already did something similar for the phred implementation, https://github.com/seqan/seqan3/commit/fdf35e500a9a0c57ba7ff7fab48ba52e82b29b24)
Regarding the extra alphabet, we could maybe add a cookbook entry for now that basically provides an efficient dna4 simd alphabet with @marehr s patch applied. Or maybe a documentation entry in the alphabt module page or even an how-to if someone has the time. With a note that the user interested in it should report back to us if he/she has a specific use case.
Adding a cookbook entry how to write an alphabet that works in auto-vectorized loops would be fine for me with that remark.
But before adding, it would probably be good to have a benchmark included, as well. What was the latest benchmark code you were using, @kloetzl ?
Here you go: https://github.com/kloetzl/libdna/blob/master/bench2/revcomp.cxx
This would just be one part, the second part would be the question of how to efficiently have sequence operations/algorithms, like e.g. "reverse_complement".
But, lacking an application that uses this, I don't know what the best API would be.
I would argue that computing the reverse_complement is more common than just the complement on its own.
I wouldn't add this as a #ifdef.
One more problem with an ifdef is portability. If an application using seqan would be compiled as a debian package on a machine supporting AVX2 it would not run on a users machine without AVX2 support.
As both are "single" instruction solutions, they should always be at least as good as the lookup table (for all use cases).
Yes, xor and subtraction are near-optimal solutions. That should be good enough for most use cases.
Alright, to me it looks like the following tasks would be open then:
What I don't see happening but is part of the discussion:
#ifdef
for SIMD available and apply this patch to dna4.I'll propose this in the next core meeting on Monday :)
I think for dna5 some potential arithmetic versions could be:
uint8_t dna5_complement_subtract(uint8_t rank)
{
return min(3 - rank, 4); // using underflow
}
uint8_t dna5_complement_xor_is_dna5(uint8_t rank)
{
bool is_n = rank > 3;
return is_n ? rank : (rank ^ 0b11); // using conditional assignment
}
uint8_t dna5_complement_xor_min(uint8_t rank)
{
rank = rank ^ 0b11;
return min(rank, 4); // correcting reverse complement of N to be N
}
Assembler suggests that dna5_complement_xor_is_dna5
might be the best variant. See https://godbolt.org/z/qqq78fvb7
Assembler suggests that
dna5_complement_xor_is_dna5
might be the best variant. See https://godbolt.org/z/qqq78fvb7
That is very similar to how bwa does it.
We agreed on the tasks outlined in comment https://github.com/seqan/seqan3/issues/1970#issuecomment-1098788971.
Writing the benchmark has highest priority.
We merged https://github.com/seqan/seqan3/pull/3026 which adds an example for an auto-vectorizable dna4 alphabet and improves the performance of the dna4-complement and also adds a benchmark.
Thanks again a lot for your work, @kloetzl !
You are welcome to open another issue if you have other suggestions, we are excited to hear about them :)
That's great to hear! Pleased I could help and we managed to produce such a nice speed up. :+1:
Platform
Description
Adding the following function to dna4 makes complementing it about 16% faster. Patch at https://github.com/kloetzl/seqan3/commit/c43427f5092b95623eb07477513bb3afc40ff083
Now, you might stop here and think this is ugly and 16% is not worth it. Or you go all in for performance and merge this patch. However, there is more to this.
I am currently looking into fast ways to complement dna strings. Comparing my own implementation with BWA and Seqan3, I noticed that the instructions generated for seqan are suboptimal. (Note that I am not an expert on seqan, I just used what was recommended in the tutorial.) Digging into the code I found that it can be simplified and boils down to a subtraction. Here are the runtimes of computing the reverse complement of a megabase:
Even with the patch seqan is almost nine time slower than BWA which also uses a subtraction. However, the BWA code gets vectorized by the compiler, whereas seqan does not. So there is something in the template magic preventing that optimization. Here is the assembly that GCC generates:
Interestingly even this assembly is kinda weird. Note how the $0x3 is stored in a register %esi, and that is then moved on each iteration of the loop. There might be something fishy going on here because rank_type is unsigned char.
Just wanted to share what I found in the hope you can make sense of it. ☺ Best, Fabian