mpark / variant

C++17 `std::variant` for C++11/14/17
https://mpark.github.io/variant
Boost Software License 1.0
659 stars 88 forks source link

WIP: use switch case to improve codegen #52

Closed quicknir closed 5 years ago

quicknir commented 5 years ago

This PR is just to provide a convenient interface for initial code review.

For now I only implemented for variants of up to 2 types. It seems to compile and pass unit tests. I tried to insert the implementation branch (to switch case) at the lowest point in the call stack just before it heads to the function pointer table. It took me a bit of trial and error to understand what to use to get the reference to the ith type from the variant at that point in the callstack so I'm hopefully using the right thing. Hopefully, access::base::get_alt is both correct and also unchecked, the latter seems to be important in getting optimal assembly (e.g. when I wrote my top level switch case code with get instead of get_if it didn't generate as good code).

Anyhow, let me know what you think, we can finalize the max number of types for the optimization later once we have the details nailed down (it's very easy to change from how it's written).

quicknir commented 5 years ago

Oh, also I'm aware you might not want to use if constexpr in the implementation. I just did it this way for now, can easily switch to tag dispatch.

I didn't do the fdiagonal into switch case yet either, that could be interesting as well. AFAICS, the "visit_at" line of functions seems to only be used for two variants of same type, visiting at the same index. Seemingly for binary operations like ==? Should probably do that too actually, seems like significantly better assembly when switch case is used: https://gcc.godbolt.org/z/To-fsq.

mpark commented 5 years ago

Is the idea to use a switch for only 1 variant upto some fixed-N alternatives? Could we at least support it for arbitrary N alternatives for 1 variant?

quicknir commented 5 years ago

@mpark Well, you can't have a switch with an arbitrary number of cases. If variant has a well defined hard limit to the number of types it can hold, then you could generate up to that point. But if variant doesn't make such an attempt, and it's only limited by e.g. compiler limits on template recursion and such, then there is no way to do the switch case approach for purely arbitrary N types in a variant.

Personally I don't see this as a big issue; we already know this doesn't cover everything and instead is supposed to be an optimization for common cases. Covering up to e.g. 20 types will in practice significantly improve codegen for the overwhelming majority (IMHO) of uses of variant.

Note also that it will also cover built in binary operations (like comparison) for up to N types, despite those involving two variants, because the dispatch only occurs if the index for both variants matches, so we can handle that with switch case quite easily as well (not yet present in implementation, wanted to verify basic approach first).

mpark commented 5 years ago

I was thinking of something like this, assuming the fixed-N is 32:

template <std::size_t B, typename T>
decltype(auto) switch_visit(T&& v) {
  switch (v.index()) {
    case B+0: /* ... */ break;
    case B+1:  /* ... */ break;
    // ...
    case B+31: /* ... */ break;
    // this should be tail-recursive-able:
    default: switch_visit<B+32>(std::forward<T>(v));
  }
}
quicknir commented 5 years ago

@mpark Very neat! I hadn't considered an approach like that. However, there's a rather significant issue, which I couldn't solve on the user side but could very well be solvable when we are working inside the implementation. The code inside the /* .. */ is basically going to ask for a reference to the type inside the variant at index I (where I is just a constant matching the case). If you use an index that isn't within the size of the variant, you immediately get a hard compilation error (at least, this is the behavior both if you use the public api with get_if, and I think it's what you see with get_alt as well). For that reason, there has to be a separate switch case function for every size of variant. As well, for performance reasons the compiler may be able to generate better code when it has the unreachable pragma.

In order to use this approach, I'd: a) need some way to get a reference to the Ith type such that a compilation error is not triggered when I is out of bounds, b) we'd need to convince ourselves that this doesn't make the assembly worse in common case. I'm not against covering these two as long as it's reasonably time efficient, but I still feel like there is little disadvantage in only covering up to fixed N.

What do you think?

mpark commented 5 years ago

@quicknir: We have the exact same issue for a fixed size N though, right? As in, if our N is bigger than the variant_size, then we're gonna have to not hard-error on get. So it seems to me like we need to solve that problem regardless.

quicknir commented 5 years ago

@mpark Well, I actually generate a separate switch case function for each size of variant. That's what the integer template parameter in the function is for. So in the current approach we don't need to change get_alt and such.

SaadAttieh commented 5 years ago

Sorry if unclear, writing on my phone. I haven't seen the code but I'm inferring from what I am reading here. I have two questions. 1. If @quicknir has a method of generating any length of switch up to some fixed n, can't this be combined with @mpark 's method of recursing in batches accept that instead of always assuming a fixed size batch, i.e. batches of 32, the next batch size can be calculated exactly based on the number of cases left to be handled. If it's more than n, (e.g. 32 as above), you pick the largest switch function, the one with n cases. But if what we have left is less than n, then we forward to a switch of length exactly equal to the amount left. Again that is assuming I read it correctly that you can generate switches of any length up to n.

Alternatively, to prevent taking a reference to the type inside the variant when the switch case is out of bounds, could you not wrap the code for each case inside a templated function, invoke_if_in_bounds? Can't write this on my phone atm but it would be the basic enable_if, if it's out of bounds, you forward to a function that does not run the code which would cause a compile error.

Regards, Saad Sent from my iPhone (excuse typos)

On 13 Oct 2018, at 4:44 am, quicknir notifications@github.com wrote:

Well, I actually generate a separate switch case function for each size of variant. That's what the integer template parameter in the function is for. So in the current approach we don't need to change get_alt and such.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

quicknir commented 5 years ago

I like your first idea a lot, I'll try to show a POC godbolt example this weekend.

mpark commented 5 years ago

Here's a version that handles arbitrary # of alternatives, with N=4, for one variant with 6 alternatives: https://godbolt.org/z/TBd9rr

mpark commented 5 years ago

Okay, I've compared the following options to visit a single variant with 7 alternatives.

Clang 7 GCC 8.2
switch HS-C HS-G
switch_visit, N=4 SV4-C SV4-G
switch_visit, N=8 SV8-C SV8-G
mpark::visit MV-C MV-G

Results (roughly.. in terms of code size/quality):

GCC doesn't seem to know how to unroll the recursive call in case it's needed, but I think that's fine as it's still better than the current manual jump table approach.

mpark commented 5 years ago

@quicknir: have you had time to look into this stuff?

quicknir commented 5 years ago

@mpark Sorry, got a bit busy, will make time this coming weekend.

mpark commented 5 years ago

@quicknir: Cool, no problem! I wanted to share my preference in direction so that we can align our expectations.

The following are some experiments with double visitation:

The fact that the templated nested switch generates the same code as the hand-rolled code makes me optimistic about even N-ary visitation performing better than the manual jump table approach.

Anyway, I'm not sure exactly what stage you're in at the moment, but let me know if you disagree so that we can figure out what we can agree to do here!

quicknir commented 5 years ago

@mpark I am fine with the recursive switch solution you suggested for single visitation. The only caveat is that I think that N is reasonable (say 10, as opposed to like 2) so that in common cases we still get optimal codegen, from gcc as well. N-ary visitation, makes me nervous; I tend to agree the codegen will probably be better, it's just that implementing it for arbitrary ariness seems like it will be significantly more complicated, and will delay things going through, and for significantly less benefit. Neither of the links obviously scales to arbitrary ariness AFAICS. It's possible that some kind of recursive approach would work (peel off the first variant each time, create a new visitor that hardcodes the position of the first variant and visits the rest normally and pass it down recursively).

Ideally (IMHO), we would start with single visitation + diagonal visitation (which IIUC is only used internally for binary operators like ==), get that to a place we're happy with and merge it, and only then start looking at N-ary visitation. That way, in parallel we can look at pushing this approach into the standard libraries. Ultimately getting single + diagonal visitation into standard libraries ASAP should be the main priority here, I think, because this is going to deliver the vast majority of the value to users. That said, I'm open if we can find a good implementation strategy. At the start I was doubtful about arbitrary N for single visitation and all the smart people on this thread changed my mind rather quickly :-).

quicknir commented 5 years ago

Okay, updated PR. It should now be doing single variant visitation with arbitrary number of types with recursive switch case. For now I only switch-case on 2 values at a time before recursing, but if you're ok with this approach I can extend to something more reasonable like 10, and also proceed to extend it to binary diagonal visitation. I think we might be pretty close!

mpark commented 5 years ago

@quicknir:

N-ary visitation, makes me nervous; I tend to agree the codegen will probably be better, it's just that implementing it for arbitrary ariness seems like it will be significantly more complicated, and will delay things going through, and for significantly less benefit.

I've put together the following snippet to show that the N-ary visitation code is achievable without too much complexity over the previous snippet. I would strongly prefer to replace the existing solution entirely, so that we get strictly better code-gen in general and also to solve problems like #41 for free. I promise it will not delay our progress.. personally I'd be much more excited/engaged if we were to pursue the full solution.

Neither of the links obviously scales to arbitrary ariness AFAICS.

I was only trying to hint that the generated code for the N-ary approach would be optimizable by the compiler with these links.

Anyway, the following is comparison of the options:

Single Visitation (7 alternatives):

Clang 7 GCC 8.2
switch HS1-C HS1-G
switch_visit SV1-C SV1-G
mpark::visit MV1-C MV1-G

Results:

Double Visitation (5 alternatives):

Clang 7 GCC 8.2
switch HS2-C HS2-G
switch_visit SV2-C SV2-G
mpark::visit MV2-C MV2-G

Results:

quicknir commented 5 years ago

@mpark Sorry, you say "I put together the following snippet" but then I'm not sure exactly which snippet you are referring to? My basic idea btw for scaling it to n-ariness is something like this:

template <class Visitor, class FirstVariant, class ... Variants>
decltype(auto) switch_visit(Visitor&& visitor, FirstVariant&& first_variant, Variants&& ... variants) {
   switch (first_variant.index()) {
      case(0): return switch_visit(make_recursing_visitor(visitor, *get_if<0>(first_variant)), std::forward<Variants>(variants)...);
      ...
   }
}

Basically, this function make_recursing_visitor is essentially performing currying for us. There's a lot of details to get right here with perfect forwarding, but I think it's basically doable. Is this what you had in mind for a completely generic solution?

mpark commented 5 years ago

@quicknir: Sorry about that. I was referring to the SV links in the tables above.

The approach you mention of retrieving the alternative per variant and passing it down the recursive calls induced push/pop calls from my experiments. It seems better to actually forward the variants through while building up the indices, and retrieve all of them at the end in the final dispatch. The following is the full solution inlined:

template <bool, typename F, typename... Vs>
struct Switch;

template <typename F, typename... Vs>
struct Switch<false, F, Vs...> {
  template <std::size_t = 0, std::size_t... Is, typename... Vs_>
  static void switch_(std::index_sequence<Is...>, F&&, Vs&&..., Vs_&&...) { __builtin_unreachable(); }
};

template <typename F, typename... Vs>
struct Switch<true, F, Vs...> {
  template <std::size_t B = 0, std::size_t... Is>
  static void switch_(std::index_sequence<Is...>, F&& f, Vs&&... vs) {
    static_assert(B == 0);
    std::invoke(std::forward<F>(f), *mpark::get_if<Is>(&vs)...);
  }

  template <std::size_t B = 0, std::size_t... Is, typename V_, typename... Vs_>
  static void switch_(std::index_sequence<Is...>, F&& f, Vs&&... vs, V_&& v_, Vs_&&... vs_) {
    static constexpr std::size_t size = mpark::variant_size_v<std::decay_t<V_>>;
    switch (v_.index()) {
      case B+0: Switch<B+0 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+0>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+1: Switch<B+1 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+1>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+2: Switch<B+2 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+2>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+3: Switch<B+3 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+3>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+4: Switch<B+4 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+4>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+5: Switch<B+5 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+5>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+6: Switch<B+6 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+6>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+7: Switch<B+7 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+7>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      default: Switch<B+8 < size, F, Vs...>::template switch_<B+8>(std::index_sequence<Is...>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<V_>(v_), std::forward<Vs_>(vs_)...); break;
    }
  }
};

template <typename F, typename... Vs>
void switch_visit(F&& f, Vs&&... vs) {
  Switch<true, F, Vs...>::switch_(std::index_sequence<>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs>(vs)...);
}
quicknir commented 5 years ago

@mpark That seems reasonable. Okay, I will put together arbitrary visitation including diagonal visitation in the PR by the end of next weekend.

mpark commented 5 years ago

I'm not sure if we even need a diagonal special case anymore with this approach actually 🤔 but I haven't tested that yet.

quicknir commented 5 years ago

@mpark Maybe it's not conclusive evidence but this snippet leads me to believe that you still need a diagonal special case to be optimal: https://godbolt.org/z/2Xek12. You can see that when I mark the off-diagonal case no inline, code is still generated for it, even though because of the early exit, it's not possible to reach it. This in turn isn't really surprising because the whole switch operation doesn't get fully inlined into f; this means that the whole visit code does not actually have knowledge of the early exit and therefore still has to generate branches etc concerning itself with the off diagonal case.

Note: I think the early exit is actually necessary; not here necessarily for equality but e.g. for operator< you can't recover the indices from the types in general, so you'll have to have the branch before the visitation, and only do visitation when indices are equal, but then you'll still have branches entering the non-diagonal cases even though they are unreachable, which is sub-optimal.

mpark commented 5 years ago

I'm not understanding what the purpose of the no-inline is... could you explain? I assume we wouldn't mark it inline, and also tell the compiler that it's unreachable: https://godbolt.org/z/N-ezqd

Anyway, I'm totally fine with implementing the diagonal case. It should be pretty simple, as it'll essentially be the single visitation code, taking arbitrary # of variants and retrieving the Ith alternative for all of them.

quicknir commented 5 years ago

@mpark Cool, I think I will just implement diagonal side-by-side with the snippet above and look at the assembly. I also realized some other improvements in your approach compared to my PR :-). Feeding the B+K < size into the template parameter is clever for sure. I'll try to put it all together by the end of the coming weekend like I said and hopefully we can move along.

quicknir commented 5 years ago

@mpark Ok tired for today but made some good progress. I was able to take your snippet and resolve both the return type issue, and get rid of double-passing all the variants. The build and unit tests all passed for me; currently it's using recursive switch-case for all non-diagonal visitation. So we're in good shape! I'll look more at diagonal visitation and make a decision there, and add commits taking care of that and removing the dead code.

For now, the number of cases before recursion is just 2; I lowered it to reduce the size of errors while working :-). We can discuss the correct number at the end; because of the macro I use it's very very easy to change the number of cases.

mpark commented 5 years ago

@quicknir: are you sure the tests pass...? It doesn't seem to compile for me 😕

../include/mpark/variant.hpp:511:38: error: constexpr variable 'size' must be initialized by a constant expression
        constexpr std::size_t size = v_.size();
                                     ^~~~~~~~~
quicknir commented 5 years ago

@mpark I only compiled locally with 17, so perhaps that is the issue? v_.size() is actually calling a static constexpr member of v_'s class, so it should still compile even if v itself is not a constexpr expression. That seems like a compiler bug?

It's not infinite recursion because what happen is that V_ gets added to the list of types that the struct is templated. Basically, the list of types in Switch are the types that have not yet been recursed through; that's why the initial call doesn't give it any types at all. Each recurse a type is added to that list. The termination of the recursion happens when Switch is templated on the full type list, at that point the switch_ overload with no extra arguments gets selected.

quicknir commented 5 years ago

Okay, so I think I've convinced myself that a separate diagonal implementation is not necessary. Here was my quick implementation of diagonal visitation: https://godbolt.org/z/JHkwE7. It doesn't seem to generate markedly better code, though it's hard to say. The current approach actually tends to encourage more inlining because there's more complex templating; with the diagonal visitor it didn't tend to inline so much.

At any rate I think it's close enough we can justify doing the simple thing and just not having a diagonal implementation.

mpark commented 5 years ago

Closed by #56