relu3 upper bounds - Githubissues

toelli-msft commented 3 years ago

~Not for merge (probably) but for information only.~

I have added relu3 examples that "go as fast as they could possibly go".

"Knossos embedded checkpointed_map_handwritten_inlined_relu3" is already about as fast as a correct ksc function could be whilst using map.
"Knossos embedded upper_bound_via_map-torch" defines relu3 and [sufrev relu3] to just return a constant or an argument. vrelu3 and [sufrev vrelu3] map them across a vector. They as fast as any ksc function could possibly be whilst using map.
"Knossos embedded upper_bound" just defines vrelu3 and [sufrev vrelu3] to return an argument and are as fast as any ksc function could possibly be.

(Of course 2 and 3 calculate incorrect values, but the point of them is not to be correct but just to give an upper bound on speed.) On test_backwards we have the following Mean numbers

Benchmark	Time (ms)
Knossos embedded upper_bound	0.1216
Knossos embedded upper_bound_via_map-torch	0.3004
PyTorch	0.3617
Knossos embedded checkpointed_map_handwritten_inlined_relu3	0.5546
Knossos embedded checkpointed_map_handwritten_relu3	0.6410
Knossos embedded checkpointed_map	1.3629

The (presumably) fastest possible correct version that uses map ("Knossos embedded checkpointed_map_handwritten_inlined_relu3") is already almost 50% slower then PyTorch. Even the "as fast as possible" version ("Knossos embedded upper_bound_via_map-torch") is only about 20% faster than PyTorch.

I have disabled accumulation into the free variable gradients in sufrevpass_map since there are no free variables in these benchmarks.

C++

The C++ for "Knossos embedded checkpointed_map_handwritten_inlined_relu3" shows that it will be hard to make faster with our current C++ implementation of map.

typedef tensor<1, double> ty$vrelu3_inlined$aT1f;
ty$vrelu3_inlined$aT1f vrelu3_inlined$aT1f(ks::allocator * $alloc, tensor<1, double> t) {
  /* Lam */auto c$0 = [=](ks::allocator * $alloc, double x) {
    bool c$2 = lt$aff($alloc, x, 0.0);
    double c$1;
    if (c$2) {
      c$1 = (0.0);
    } else {
      bool c$4 = lt$aff($alloc, x, 1.0);
      double c$3;
      if (c$4) {
        double c$5 = mul$aff($alloc, x, x);
        double c$6 = mul$aff($alloc, x, c$5);
        double c$7 = div$aff($alloc, c$6, 3.0);
        c$3 = (c$7);
      } else {
        double c$8 = div$aff($alloc, 2.0, 3.0);
        double c$9 = sub$aff($alloc, x, c$8);
        c$3 = (c$9);
      }
      c$1 = (c$3);
    }
    return (c$1);
  };
  tensor<1, double> c$10 = map($alloc, c$0, t);
  return (c$10);
}

typedef tensor<1, double> ty$sufrev$vrelu3_inlined$aT1f;
ty$sufrev$vrelu3_inlined$aT1f sufrev$vrelu3_inlined$aT1f(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
  /* Lam */auto c$0 = [=](ks::allocator * $alloc, double x) {
    bool c$2 = lt$aff($alloc, x, 0.0);
    double c$1;
    if (c$2) {
      c$1 = (0.0);
    } else {
      bool c$4 = lt$aff($alloc, x, 1.0);
      double c$3;
      if (c$4) {
        double c$5 = mul$aff($alloc, x, x);
        c$3 = (c$5);
      } else {
        c$3 = (1.0);
      }
      c$1 = (c$3);
    }
    return (c$1);
  };
  tensor<1, double> c$6 = map($alloc, c$0, t);
  return (c$6);
}

awf commented 3 years ago

Super useful, thanks. I read two things, please correct if wrong:

We are paying ~100us just to call Ksc. We should certainly look into that.
Our map implementation is not being well optimized. We should look into that too.

Would love to see

(a) the same numbers on 16M values (b) a non-map optimal C++ implementation. This will be paying the 121us for the call, but will check the "map is slow" hypothesis.

toelli-msft commented 3 years ago

https://github.com/microsoft/knossos-ksc/pull/877/commits/135a2c1dfccb61dca6bc192d9072e7a202632cab has an embedded C++ version of vrelu3 and sufrev_vrelu3 with the map turned into a direct loop, i.e. no lambda. It's not any faster than the fastest correct ks version.

Any tips on how to speed this C++ up? Do we need to go to SIMD?

And where should we discuss this? Yordan suggested OneNote. The benefit of keeping it here is that we can directly link commits and make comments on specific parts of them.

        tensor<1, double> vrelu3(ks::allocator * $alloc, tensor<1, double> t) {
            auto tdata = t.data();
            auto ret = tensor<1, double>::create($alloc, t.size());
            auto retdata = ret.data();
            for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
                double c$1;
                double x = tdata[i];
                if (x < 0.0) {
                    c$1 = 0.0;
                } else {
                    if (x < 1.0) {
                        c$1 = x * x * x / 3.0;
                    } else {
                        c$1 = x - 2.0 / 3.0;
                    }
                }
                retdata[i] = c$1;
            }
            return ret;
        }
        tensor<1, double> sufrev_vrelu3(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
            auto tdata = t.data();
            auto dretdata = dret.data();
            auto ret = tensor<1, double>::create($alloc, t.size());
            auto retdata = ret.data();
            for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
                double c$1;
                double x = tdata[i];
                double dreti = dretdata[i];
                if (x < 0.0) {
                    c$1 = 0.0;
                } else {
                    if (x < 1.0) {
                        c$1 = x * x;
                    } else {
                        c$1 = 1.0;
                    }
                }
                retdata[i] = c$1 * dreti;
            }
            return ret;
        }

awf commented 3 years ago

Excellent, thanks. Even if not faster, it's great to have it in one function. Next steps:

Ask @dcrc2 :)
What are the numbers on 16M floats?
Remove the two "if"s in the inner loop, so it's just c$1 = x * x; and see what that gets us

toelli-msft commented 3 years ago

324cd83d has problem size 1M and a benchmark with no if. Without if it is more than twice as fast and faster than PyTorch! In fact it roughly matches the upper bound by map. I'm amazed that a conditional can make so much difference.

(I haven't managed to get it to work on problem size 16M yet. Some of the benchmarks are very slow and don't terminate in reasonable time. Unfortunately I can't see which ones are slow without waiting for them to terminate! I'm just selectively disabling them with educated guesses.)

def vrelu3_embedded_cpp_inlined_map_no_if():
    return cpp_string_to_autograd_function(
        """
        namespace ks{
        tensor<1, double> vrelu3(ks::allocator * $alloc, tensor<1, double> t) {
            auto tdata = t.data();
            auto ret = tensor<1, double>::create($alloc, t.size());
            auto retdata = ret.data();
            for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
                double x = tdata[i];
                retdata[i] = x * x * x / 3.0;
            }
            return ret;
        }

        tensor<1, double> sufrev_vrelu3(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
            auto tdata = t.data();
            auto dretdata = dret.data();
            auto ret = tensor<1, double>::create($alloc, t.size());
            auto retdata = ret.data();
            for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
                double x = tdata[i];
                double dreti = dretdata[i];
                retdata[i] = x * x * dreti;
            }
            return ret;
        }
        }
        """,
        "vrelu3",
        generate_lm=False,
    )

toelli-msft commented 3 years ago

Similar results on f583a968 at problem size 16M. It is now clear that C++ with inlined map and no if is faster even than ks upper bound via map.

How can omitting the if make such a difference and what can we do about it?

toelli-msft commented 3 years ago

Tangentially, @cgravill, at large problem sizes the PyTest benchmarks are very slow to run. As you can see from the below table, the forward and inference benchmarks adapt the number of rounds so that they finish within about a second each. The backwards benchmarks don't adapt. They run 1000 regardless. It seems to be to do with pedantic mode. Is there anything we can do to improve that? At large problem sizes it's painful to wait so long.

cgravill commented 3 years ago

@toelli-msft yes the determining how many to run is the subject of AB#19173 planned for this week.

Yes, pedantic mode is problematic. As you've linked to, we resorted to using it so we could provide a setup function. Unfortunately I haven't found an api within pytest-benchmarks to auto-determine the rounds but supply such a function.

The intention is to replace the rounds logic with our own, then use pedantic throughout.

Short term, you can reduce the rounds... maybe in proportion to tensor size but that's perhaps a little too clever for something that needs replacing. The 1000 was very grossly empirically arrived.

awf commented 3 years ago

How can omitting the if make such a difference and what can we do about it?

Branch prediction failures...

Can you try

Correct computation using mask and multiply for both ifs (x>0)*((x<=1)*val0to1 + (x>1)*val1up)
Variant of above with a single 'if', either the outer or the inner.

And most of these might require manual CSE / ANFing depending on c++ compiler.

Try fast-math too.

cgravill commented 3 years ago

And where should we discuss this? Yordan suggested OneNote. The benefit of keeping it here is that we can directly link commits and make comments on specific parts of them.

Process tangent: there is the GitHub Discussions tab which we could use. It's less ephemeral than PR comments which I think was the main concern about discussion on PRs, but it would add yet another location making search/discovery more awkward.

cgravill commented 3 years ago

@toelli-msft #890 is merged. If you merge that to your branch you should get dynamically calculated rounds for the backward test as well making this easier to investigate.

toelli-msft commented 3 years ago

1c69cf74

Masking makes KS and C++ faster than PyTorch!

awf commented 3 years ago

Suggestions to try:

if (x>0) then ...masked mid and upper..
ternary operator, with/without masks
zero tensor before loop, and store only for x>0 case (probably requires 1.)
compiler flags: fast-math, funroll-loops, -O7, -march=??
check for auto-vectorization - grep '[xyz]mm' vrelu3.s

awf commented 3 years ago

And possibly implement #891

toelli-msft commented 3 years ago

99114eda

Sadly I forgot to add the multiplication by ddr to the masked examples. After adding it C++ is slower than PyTorch again. (KS mask is still faster than PyTorch but technically incorrect, as it doesn't have the multiplication by ddr. It is not detected as incorrect as ddr is 1 in the benchmarks.) Furthermore, PyTorch is still supreme at problem size 1M. :(

awf commented 3 years ago

Still several things to explore.

Why is ksc map faster than cpp mask here?

toelli-msft commented 3 years ago

Why is ksc map faster than cpp mask here?

KS mask is still faster than PyTorch (and cpp mask) but technically incorrect, as it doesn't have the multiplication by ddr

toelli-msft commented 3 years ago

9d21172ecfa65969fbdb86a8693a5a00b72c6c34

Having said that, using ddr in KS mask doesn't seem to make as big a difference as it does in C++ mask. KS mask is within touching distance of PyTorch. I don't know why KS mask would be faster than C++ mask! I will investigate.

toelli-msft commented 3 years ago

This is the C++ generated from KS mask

typedef tensor<1, double> ty$sufrev$vrelu3_mask$aT1f;
ty$sufrev$vrelu3_mask$aT1f sufrev$vrelu3_mask$aT1f(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
  /* Lam */auto c$0 = [=](ks::allocator * $alloc, double x_ddriarg1, double x_ddriarg2) {
    tuple<double,double> x_ddri = ks::make_tuple(x_ddriarg1,x_ddriarg2);
    auto [x, ddri] = x_ddri;
    bool c$1 = gt$aff($alloc, x, 0.0);
    double c$2 = bool_to_float$ab($alloc, c$1);
    bool c$3 = lte$aff($alloc, x, 1.0);
    double c$4 = bool_to_float$ab($alloc, c$3);
    double c$5 = mul$aff($alloc, x, x);
    double c$6 = mul$aff($alloc, c$4, c$5);
    bool c$7 = gt$aff($alloc, x, 1.0);
    double c$8 = bool_to_float$ab($alloc, c$7);
    double c$9 = mul$aff($alloc, c$8, 1.0);
    double c$10 = add$aff($alloc, c$6, c$9);
    double c$11 = mul$aff($alloc, c$2, c$10);
    double c$12 = mul$aff($alloc, c$11, ddri);
    return (c$12);
  };
  tensor<1, double> c$13 = map2($alloc, c$0, t, dret);
  return (c$13);
}

I don't know why it should be faster than the embedded C++ mask

        tensor<1, double> sufrev_vrelu3(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
            auto tdata = t.data();
            auto dretdata = dret.data();
            auto ret = tensor<1, double>::create($alloc, t.size());
            auto retdata = ret.data();
            for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
                double c$1;
                double x = tdata[i];
                double dreti = dretdata[i];
                auto val0to1 = x * x;
                auto val1up = 1.0;

                c$1 = (x>0)*((x<=1)*val0to1 + (x>1)*val1up) * dreti;

                retdata[i] = c$1;
            }
            return ret;
        }

toelli-msft commented 3 years ago

Suggestions to try:

if (x>0) then ...masked mid and upper..

ternary operator, with/without masks

zero tensor before loop, and store only for x>0 case (probably requires 1.)

compiler flags: fast-math, funroll-loops, -O7, -march=??

check for auto-vectorization - grep '[xyz]mm' vrelu3.s

@dcrc2 and I have tried these suggestions and nothing makes a difference.

awf commented 3 years ago

Suggestions to try:

if (x>0) then ...masked mid and upper..

ternary operator, with/without masks

zero tensor before loop, and store only for x>0 case (probably requires 1.)

compiler flags: fast-math, funroll-loops, -O7, -march=??

check for auto-vectorization - grep '[xyz]mm' vrelu3.s

@dcrc2 and I have tried these suggestions and nothing makes a difference.

What -march= flags did you try? Did you find xmm in the .s?

toelli-msft commented 3 years ago

daf61784

It would be good to merge this version because it contains implementations of most of the things we care about, and no vestigial implementations. However, the results are completely baffling, showing ks_checkpointed_map_handwritten_relu3 beating PyTorch and Aten even at large problem sizes. It seems impossible. In fact it's as fast or faster than Embedded INCORRECT_ks_upper_bound_via_map which really ought to be impossible.

I haven't yet investigated how this could happen. Perhaps I've made an error somewhere.

python3 -m pytest src/bench/ --benchmark-min-rounds=1000 --benchmark-columns=median,mean,iqr,rounds,iterations --benchmark-sort=name --benchmark-group-by=group,func --modulepath=examples/dl-activations/relu3 --benchmarkname=vrelu3 --benchmark-autosave

toelli-msft commented 3 years ago

I've just run 9d21172e and obtained the same results that I obtained before. I can think of a few explanations

We've made a change on master since 9d21172e that changes performance a lot
There's a subtle difference in the way these benchmarks are implemented that I haven't noticed, and it changes performance a lot
I've made some error somewhere and the numbers we're seeing don't mean what they look like

Any other possibilities?

toelli-msft commented 3 years ago

Things that have changed in between that could plausibly make a difference, in order of most likely to have made a difference to least likely:

test_run_bench.py is running with torch.no_grad()
map2 instead of map
benchmark_shim.py

Do PyTorch experts suspect that with torch.no_grad() would make all of "backwards", "forward" and "inference" run faster, or is that implausible?

cgravill commented 3 years ago

with torch.no_grad() should only be affecting inference if anything #895

toelli-msft commented 3 years ago

Oh woe, a Heisenbug. On eb66ff9a7b492240de13e906d3234ec1009d6149, Embedded ks_checkpointed_map_handwritten_inlined_relu3 is slower than PyTorch. However, on 797fe1158f4caf6b94e6653b8601b8728f9151b4, which does nothing but enable more benchmarks, it is faster.

Can someone please try to replicate these results? Something weird is going on.

eb66ff9a7b492240de13e906d3234ec1009d6149

797fe1158f4caf6b94e6653b8601b8728f9151b4

dcrc2 commented 3 years ago

On eb66ff9, Embedded ks_checkpointed_map_handwritten_inlined_relu3 is slower than PyTorch. However, on 797fe11, which does nothing but enable more benchmarks, it is faster.

Can someone please try to replicate these results? Something weird is going on.

Yes, I'm seeing the same thing. The absolute timings for PyTorch are fairly consistent. But for Embedded ks_checkpointed_map_handwritten_inlined_relu3, 797fe11 is much faster than eb66ff9.

eb66ff9:

797fe11:

toelli-msft commented 3 years ago

Can someone please try to replicate these results? Something weird is going on.

@dcrc2 and I resolved the weirdness by fixing a C++ bug. The reason that the bug wasn't picked up by correctness checking is an interesting story which we will relate later.

Anyway, this PR is now back on track.

toelli-msft commented 3 years ago

This is ready for proper review. Questions to resolve

Do we want to merge the increased allocation size limit to master?

Surprising observations

Using knossos-prelude's bool_to_float function is inexplicably faster (up to 2x faster) than just treating a bool as a double. (Assembly below, but I can't understand it). [bool_to_float in knossos-prelude.h] [EDIT: Assembly deleted because it wasn't the optimised assembly]

`bool_to_float` versus implicit conversion assembly difference

bool_to_float left, implicit conversion right

                retdata[i] = bool_to_float$ab($alloc, x > 0)
                    * (bool_to_float$ab($alloc, in0to1) * val0to1
                       + bool_to_float$ab($alloc, !in0to1) * val1up);

versus

                retdata[i] = (x>0)*(in0to1*val0to1 + (!in0to1)*val1up);

dcrc2 commented 3 years ago

This is ready for proper review. Questions to resolve

Do we want to merge the increased allocation size limit to master?

I'm not really sure what the purpose of this allocation limit is. I don't see a need to prohibit any allocation size less than the size of the buffer.

One useful thing that this check does is to protect against a negative-size allocation being requested. If that's it's real purpose, then I'd suggest we replace the limit with SIZE_MAX/2 and avoid this issue coming up again.

toelli-msft commented 3 years ago

Apologies for my misleading post of assembly output earlier. It did not have -O3 enabled. For the avoidance of doubt, here is a corrected version with all the details.

EDIT: Oh, and for further avoidance of doubt, the numbers in the benchmark results above were produced with -O3 enabled, as can be verified from https://github.com/microsoft/knossos-ksc/blob/6f8b0549719484c0663245ba87bc95bc39933cef/src/python/ksc/compile.py#L266

The command line

gcc -std=c++17 -O3 -I$HOME/knossos-ksc/src/runtime -S /tmp/dots.cpp

Using -g filled the diff full of debugging symbol noise, so I removed it.

The source file

/tmp/dots.cpp

#include "knossos.h"

namespace ks{
tensor<1, double> vrelu3(ks::allocator * $alloc, tensor<1, double> t) {
    auto tdata = t.data();
    auto ret = tensor<1, double>::create($alloc, t.size());
    auto retdata = ret.data();
    for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
        double x = tdata[i];
        auto val0to1 = x * x * x / 3.0;
        auto val1up = x - 2.0 / 3.0;
        auto in0to1 = x <= 1;

        retdata[i] = (x>0)*(in0to1*val0to1 + (!in0to1)*val1up);
    }
    return ret;
}

tensor<1, double> sufrev_vrelu3(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
    auto tdata = t.data();
    auto dretdata = dret.data();
    auto ret = tensor<1, double>::create($alloc, t.size());
    auto retdata = ret.data();
    for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
        double x = tdata[i];
        double dreti = dretdata[i];
        auto val0to1 = x * x;
        auto val1up = 1.0;

        auto in0to1 = x <= 1;

        retdata[i] = (x>0)*(in0to1*val0to1 + (!in0to1)*val1up)*dreti;
    }
    return ret;
}
}

namespace ks{
tensor<1, double> vrelu3_bool_to_float(ks::allocator * $alloc, tensor<1, double> t) {
    auto tdata = t.data();
    auto ret = tensor<1, double>::create($alloc, t.size());
    auto retdata = ret.data();
    for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
        double x = tdata[i];
        auto val0to1 = x * x * x / 3.0;
        auto val1up = x - 2.0 / 3.0;
        auto in0to1 = x <= 1;

        retdata[i] = bool_to_float$ab($alloc, x > 0)
            * (bool_to_float$ab($alloc, in0to1) * val0to1
               + bool_to_float$ab($alloc, !in0to1) * val1up);
    }
    return ret;
}

tensor<1, double> sufrev_vrelu3_bool_to_float(ks::allocator * $alloc, tensor<1, double> t, tensor<1, double> dret) {
    auto tdata = t.data();
    auto dretdata = dret.data();
    auto ret = tensor<1, double>::create($alloc, t.size());
    auto retdata = ret.data();
    for (int i = 0, ne = t.num_elements(); i != ne; ++i) {
        double x = tdata[i];
        double dreti = dretdata[i];
        auto val0to1 = x * x;
        auto val1up = 1.0;

        auto in0to1 = x <= 1;

        retdata[i] = bool_to_float$ab($alloc, x > 0)
                     * (bool_to_float$ab($alloc, in0to1) * val0to1
                        + bool_to_float$ab($alloc, !in0to1) * val1up)
                     * dreti;
    }
    return ret;
}
}

Assembly output

Left vrelu3, right vrelu3_bool_to_float

dcrc2 commented 3 years ago

2x difference is amazing. The version on the right uses seta where the one on the left uses ja, could that be it? That is, there's an extra branch inside the loop? But if so, I'm not sure what this tells us except that we are extremely sensitive to compiler optimizations.

toelli-msft commented 3 years ago

@cgravill Can you comment on the impact of merging this as-is? I believe it means that all of these benchmarks will run nightly (and possibly appear on graphs).

If we don't want that to happen can you suggest an alternative approach? It's important that this code is in and under CI in some form. Future work depends on it (e.g. https://github.com/microsoft/knossos-ksc/pull/925).

What's the right thing to do here?

cgravill commented 3 years ago

I think it would mean:

all the configured benchmarks are run (subject to GPU checks) - GOOD
the ones that pass would then be saved - GOOD
they'd be added to charts - MIXED

We have get-out though as vrelu isn't configured to run:

https://github.com/microsoft/knossos-ksc/blob/b5f6f0bca3d579651432e88d074d82bd45416b7e/src/bench/run-all-pytest-bench.sh#L4-L6

So working on the assumption that you're not planning to adjust that as part of this work I'd say go ahead as soon as you're ready.

The wider issue is that perhaps not everyone would want to be running all these benchmarks, all the time. Pytest has pretty good filtering which we might want to make use of.

cgravill commented 3 years ago

This is perhaps a conversation for a different time, but I think of the bits in /examples as for showing to external folks. With one of our proposed benefits that we'll take care of the complexity / C++ for you then we might want to think around progressive reveal for only the folks who want to dig.

toelli-msft commented 3 years ago

Given @cgravill's comments above I think the best thing to do is to merge this now. Subsequently we can fix up the benchmarks to configure them more to our needs.

toelli-msft commented 3 years ago

(@awf This now includes a fix for https://github.com/microsoft/knossos-ksc/issues/937, which I'm guessing is the right thing to do)

toelli-msft commented 3 years ago

More mysteries. On a013805b24b6cfd0d591fdc64c82b7e41a2b2763, which is just supposed to be a rebased version of the earlier version ce5b106fd27d9abfefab7e91ef51a4149bac67ab, the explicit bool_to_float version is now quite a lot slower.

I'm feeling baffled. Could someone, perhaps @dcrc2, try to reproduce these numbers?

toelli-msft commented 3 years ago

For the avoidance of doubt, I have rerun the old version, and I still see the old behaviour: bool_to_float is faster.

A small correction to my previous comment: the commit mentioned above (ce5b106f) needed a small tweak to work, so the commit these results are generated from is 7ae6c13f.

toelli-msft commented 3 years ago

Oh, hang on, the obvious smoking gun here is that we're using float32 instead of float64. That would explain why we get a huge speed boost! OK, I don't think I'm baffled anymore. That change is sufficient to explain what we've seen.

toelli-msft commented 3 years ago

So, I am still content that this is ready for merge.

dcrc2 commented 3 years ago

Oh, hang on, the obvious smoking gun here is that we're using float32 instead of float64. That would explain why we get a huge speed boost! OK, I don't think I'm baffled anymore. That change is sufficient to explain what we've seen.

Yes, any conclusions we drew for the float64 case aren't going to apply any more. I'm still a bit confused as to why cpp_mask_bool_to_float is so different from cpp_mask, with the bool_to_float version now being significantly slower. (I've reproduced this.) But yes please let's merge this anyway!

microsoft / knossos-ksc

relu3 upper bounds #877

C++

eb66ff9a7b492240de13e906d3234ec1009d6149

797fe1158f4caf6b94e6653b8601b8728f9151b4

`bool_to_float` versus implicit conversion assembly difference

The command line

The source file

Assembly output

microsoft / knossos-ksc

relu3 upper bounds #877

C++

eb66ff9a7b492240de13e906d3234ec1009d6149

797fe1158f4caf6b94e6653b8601b8728f9151b4

bool_to_float versus implicit conversion assembly difference

The command line

The source file

Assembly output

`bool_to_float` versus implicit conversion assembly difference