google / wuffs

Wrangling Untrusted File Formats Safely
Other
4.06k stars 129 forks source link

Slow f64 parsing #113

Closed atrieu closed 1 year ago

atrieu commented 1 year ago

I modified Lemire's simple_fastfloat_benchmark to include testing wuffs' f64 parsing implementation. On some benchmarks extracted from Tao's parse-number-fxx-test-data, it is the slowest parser, but it does fine on some others.

My tests can be found here. I'm not a C/C++ expert, so hopefully it's not because of a stupid mistake I made.

> grep model.name /proc/cpuinfo | uniq
model name  : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
> ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt
# read 599458 lines
volume = 14.4666 MB
netlib                                  :   121.78 MB/s (+/- 2.9 %)     5.05 Mfloat/s      86.69 i/B  2193.64 i/f (+/- 0.0 %)     29.13 c/B   737.23 c/f (+/- 0.7 %)      2.98 i/c      3.72 GHz
doubleconversion                        :   296.08 MB/s (+/- 1.9 %)    12.27 Mfloat/s      42.37 i/B  1072.09 i/f (+/- 0.0 %)     11.52 c/B   291.58 c/f (+/- 0.9 %)      3.68 i/c      3.58 GHz
strtod                                  :   175.87 MB/s (+/- 1.1 %)     7.29 Mfloat/s      53.96 i/B  1365.46 i/f (+/- 0.0 %)     19.36 c/B   489.94 c/f (+/- 0.7 %)      2.79 i/c      3.57 GHz
abseil                                  :   542.26 MB/s (+/- 1.5 %)    22.47 Mfloat/s      25.27 i/B   639.51 i/f (+/- 0.0 %)      6.29 c/B   159.15 c/f (+/- 0.6 %)      4.02 i/c      3.58 GHz
fastfloat                               :   736.51 MB/s (+/- 2.6 %)    30.52 Mfloat/s      18.42 i/B   466.17 i/f (+/- 0.0 %)      4.65 c/B   117.58 c/f (+/- 1.2 %)      3.96 i/c      3.59 GHz
wuffs                                   :    21.44 MB/s (+/- 0.7 %)     0.89 Mfloat/s     490.44 i/B 12410.66 i/f (+/- 0.0 %)    160.58 c/B  4063.58 c/f (+/- 0.3 %)      3.05 i/c      3.61 GHz
> ./build/benchmarks/benchmark -f data/google-double-conversion-extracted.txt
# read 564745 lines
volume = 12.7842 MB
netlib                                  :   110.79 MB/s (+/- 1.8 %)     4.89 Mfloat/s      95.72 i/B  2272.09 i/f (+/- 0.0 %)     32.16 c/B   763.49 c/f (+/- 0.9 %)      2.98 i/c      3.74 GHz
doubleconversion                        :   296.65 MB/s (+/- 1.3 %)    13.10 Mfloat/s      43.91 i/B  1042.18 i/f (+/- 0.0 %)     11.87 c/B   281.84 c/f (+/- 0.6 %)      3.70 i/c      3.69 GHz
strtod                                  :   173.15 MB/s (+/- 1.4 %)     7.65 Mfloat/s      56.60 i/B  1343.55 i/f (+/- 0.0 %)     20.37 c/B   483.47 c/f (+/- 0.5 %)      2.78 i/c      3.70 GHz
abseil                                  :   544.64 MB/s (+/- 1.5 %)    24.06 Mfloat/s      26.22 i/B   622.35 i/f (+/- 0.0 %)      6.44 c/B   152.95 c/f (+/- 0.6 %)      4.07 i/c      3.68 GHz
fastfloat                               :   746.40 MB/s (+/- 1.1 %)    32.97 Mfloat/s      18.68 i/B   443.46 i/f (+/- 0.0 %)      4.71 c/B   111.87 c/f (+/- 0.5 %)      3.96 i/c      3.69 GHz
wuffs                                   :    20.41 MB/s (+/- 1.9 %)     0.90 Mfloat/s     528.80 i/B 12551.93 i/f (+/- 0.0 %)    173.53 c/B  4119.02 c/f (+/- 0.3 %)      3.05 i/c      3.71 GHz
> ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib                                  :   283.61 MB/s (+/- 3.8 %)    16.30 Mfloat/s      31.95 i/B   582.90 i/f (+/- 0.0 %)     12.40 c/B   226.34 c/f (+/- 0.5 %)      2.58 i/c      3.69 GHz
doubleconversion                        :   265.53 MB/s (+/- 3.1 %)    15.26 Mfloat/s      52.52 i/B   958.32 i/f (+/- 0.0 %)     13.61 c/B   248.31 c/f (+/- 0.8 %)      3.86 i/c      3.79 GHz
strtod                                  :   138.31 MB/s (+/- 2.9 %)     7.95 Mfloat/s      70.11 i/B  1279.24 i/f (+/- 0.0 %)     26.08 c/B   475.89 c/f (+/- 0.9 %)      2.69 i/c      3.78 GHz
abseil                                  :   421.03 MB/s (+/- 2.8 %)    24.20 Mfloat/s      31.15 i/B   568.47 i/f (+/- 0.0 %)      8.58 c/B   156.61 c/f (+/- 0.6 %)      3.63 i/c      3.79 GHz
fastfloat                               :  1073.26 MB/s (+/- 3.3 %)    61.68 Mfloat/s      14.14 i/B   257.92 i/f (+/- 0.0 %)      3.37 c/B    61.47 c/f (+/- 0.8 %)      4.20 i/c      3.79 GHz
wuffs                                   :   865.79 MB/s (+/- 2.8 %)    49.75 Mfloat/s      16.96 i/B   309.50 i/f (+/- 0.0 %)      4.17 c/B    76.17 c/f (+/- 0.7 %)      4.06 i/c      3.79 GHz
ddevienne commented 1 year ago

How does C++17's https://en.cppreference.com/w/cpp/header/charconv compare to those?

nigeltao commented 1 year ago

https://git.sr.ht/~atrieu/simple_fastfloat_benchmark/tree/master/item/README.md

says

cmake -B build .
cmake --build build
./build/benchmarks/benchmark

This builds a debug (non-optimized) binary instead of a release (optimized) binary. Wuffs differs from the other libraries you're measuring (netlib, doubleconversion, etc) in that you're building Wuffs from source. The other libraries are pre-built.


I'm not very familiar with cmake but try replacing

cmake -B build .

with

cmake -B build -DCMAKE_BUILD_TYPE=Release .

Basically, running grep -r ^CXX_FLAGS your_build_directory should see CXX_FLAGS = -O3 -DNDEBUG. In your benchmark.cpp file, you can also add these lines:

#ifndef NDEBUG
#error "Benchmarks should only be compiled with full optimization."
#endif
nigeltao commented 1 year ago

CC @lemire in case there's anything worth backporting to the upstream simple_fastfloat_benchmark repo. I'll repeat that I'm not very familiar with cmake. I also don't have a Windows system readily available, so I can't test out its "Build on Windows" instructions.

lemire commented 1 year ago

The expected file format, in simple_fastfloat_benchmark, is a list of ASCII numbers, one per line, like so...

-65.613616999999977
43.420273000000009
-65.619720000000029
43.418052999999986
-65.625
43.421379000000059
-65.636123999999882
43.449714999999969
-65.633056999999951
43.474709000000132
-65.611389000000031
43.513054000000068
-65.605835000000013
43.516105999999979
-65.598343
43.515830999999935
-65.566101000000003
43.508331000000055
...

I am not sure what happens if you have some other format. I would consider the benchmark results invalid.

The other libraries are pre-built.

The fast_float library is header-only. Except for strtod, everything is built by cmake.

This builds a debug (non-optimized) binary instead of a release (optimized) binary

You'd be correct if everything was "by default", but the CMakeLists.txt file overrides the default... You can check that in simple_fastfloat_benchmark, the default build is Release:

Capture d’écran, le 2023-04-05 à 21 22 15

This does not work under Windows, but the README file provides the instructions:

Capture d’écran, le 2023-04-05 à 21 18 37

I'd like to suggest (though I did not investigate) that the atrocious results regarding wuffs are not meaningful. They are likely caused by pointing the software at unexpected file formats.

lemire commented 1 year ago

Though I am not very explicit in the README, it does say that we try to parse each line as a number:

Capture d’écran, le 2023-04-05 à 21 33 16

Please make sure that this is what you are pointing the software at.

atrieu commented 1 year ago

I only followed the instructions from the original README, so it should be compiled in Release mode as I understand. I do use the benchmarks on correct files I think, e.g.,

.0
.0000
0
0.0
0.00
0.000
0.00000
0.00000000000000000000000000000000000000000000000000000000000
0.000000000000000000000000000000000000000000000000000000000000
0.00000e+0
0.00e+0
0.0e+0
0E1
0E11328
0E3
0E4
0E6
0E8
0E86
0e+0
0e+59
1e-325
2.5e-324
40000e-328
4e-324
5e-324
3.6118391954597633033930746e-310
4.8980019078642963771876647e-310
1.0263847514109591822395668e-309
1.2191449367984545884039792e-309
1.3586641673566107715029324e-309
1.5947055924688004253007524e-309
1.6597654066767748210494446e-309
1.8594943567529094619479745e-309
1.9852357578498988925763729e-309
2.1233738662160330012710983e-309

These were obtained by extracting the fourth column from Nigel's data set, i.e., awk '{print $4}' ../../parse-number-fxx-test-data/data/google-double-conversion.txt > google-double-conversion-extracted.txt

The diff on benchmark.cpp is as follows.

diff --git a/benchmarks/benchmark.cpp b/benchmarks/benchmark.cpp
index 0246082..3380eba 100644
--- a/benchmarks/benchmark.cpp
+++ b/benchmarks/benchmark.cpp
@@ -3,6 +3,12 @@
 #include "absl/strings/numbers.h"
 #include "fast_float/fast_float.h"

+#define WUFFS_IMPLEMENTATION
+#define WUFFS_CONFIG__STATIC_FUNCTIONS
+#define WUFFS_CONFIG__MODULES
+#define WUFFS_CONFIG__MODULE__BASE
+#include "wuffs-unsupported-snapshot.c"
+
 #ifdef ENABLE_RYU
 #include "ryu_parse.h"
 #endif
@@ -128,6 +134,28 @@ double findmax_strtod(std::vector<std::string> &s) {
   }
   return answer;
 }
+
+uint32_t wuffs_options = WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_UNDERSCORES \
+  | WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_MULTIPLE_LEADING_ZEROES;
+
+double findmax_wuffs(std::vector<std::string> &s) {
+  double answer = 0;
+  double x = 0;
+  for (std::string &st : s) {
+    wuffs_base__slice_u8 sx = {
+      .ptr = (uint8_t*)st.data(),
+      .len = st.size()
+    };
+    wuffs_base__result_f64 res = wuffs_base__parse_number_f64(sx, wuffs_options);
+    if (res.status.repr) {
+      throw std::runtime_error("bug in findmax_wuffs");
+    }
+    x = res.value;
+    answer = answer > x ? answer : x;
+  }
+  return answer;
+}
+
 // Why not `|| __cplusplus > 201703L`? Because GNU libstdc++ does not have
 // float parsing for std::from_chars.
 #if defined(_MSC_VER)
@@ -292,6 +320,7 @@ void process(std::vector<std::string> &lines, size_t volume) {
 #endif
   pretty_print(volume, lines.size(), "abseil", time_it_ns(lines, findmax_absl_from_chars, repeat));
   pretty_print(volume, lines.size(), "fastfloat", time_it_ns(lines, findmax_fastfloat, repeat));
+  pretty_print(volume, lines.size(), "wuffs", time_it_ns(lines, findmax_wuffs, repeat));
 #ifdef FROM_CHARS_AVAILABLE_MAYBE
   pretty_print(volume, lines.size(), "from_chars", time_it_ns(lines, findmax_from_chars, repeat));
 #endif
@@ -308,6 +337,7 @@ void fileload(const char *filename) {
   lines.reserve(10000); // let us reserve plenty of memory.
   size_t volume = 0;
   while (getline(inputfile, line)) {
+    if (0 < line.size() && line[0] == '.') { line = "0" + line; }
     volume += line.size();
     lines.push_back(line);
   }

The defines are the same as in manual-test-parse-number-f64.cc so hopefully they are correct. I think I'm correctly calling wuffs_base__parse_number_f64 (?) It should throw an error otherwise. The last modification is to avoid having a leading dot as wuffs' implementation doesn't support it, but this is not part of the performance measure anyway so shouldn't affect the numbers.

lemire commented 1 year ago

I only followed the instructions from the original README, so it should be compiled in Release mode as I understand.

It should yes.

20 MB/s is very slow, however. Have you do some profiling? Just disable all kernels, and run the benchmark with perf record <command line> followed by perf report. You should be able to see something glaring at you.

atrieu commented 1 year ago

No, I had not done any profiling as I don't really know much about it. But here are the results from using perf ps_20230406162209

For comparison, on data/canada.txt where wuffs performs more as would be expected. ps_20230406163103

I'm not sure I understand what this all means, are the small_xshift functions too slow ?

nigeltao commented 1 year ago

"Debug vs Release" compiler flags was an incorrect guess. Sorry.

Digging deeper into it, test data like data/ulfjack-ryu-extracted.txt contains many lines like 1.0168286519992372611942638e+100. Crucially, this has more than 19 significant digits, so the Eisel-Lemire algorithm, the fastest algorithm, does not apply. Both Wuffs and fast_float fail over to a slower fallback algorithm.

Wuffs' fallback algorithm is Simple Decimal Conversion. fast_float also used to fall back to SDC, but more recently uses the big-integer arithmetic algorithm. I'm not familiar with this newer algorithm but it's presumably faster than SDC.

(The parse-number-fxx-test-data test suite is more about testing correctness than testing performance. It isn't about collecting a representative sample of real world numbers. For example, when printing float64 numbers, 19 significant digits is often enough.)

atrieu commented 1 year ago

I do understand that data/ulfjack-ryu-extracted.txt is not really representative of real world numbers since Wuffs performs similarly to fast_float on canada.txt for instance. But, for the record, here are the results when using tag v2.0.0 (last release still using the SDC algorithm) of fast_float on data/ulfjack-ryu-extracted.txt.

# read 599458 lines
volume = 14.4666 MB
netlib                                  :   122.65 MB/s (+/- 5.0 %)     5.08 Mfloat/s      86.69 i/B  2193.64 i/f (+/- 0.0 %)     28.83 c/B   729.60 c/f (+/- 2.7 %)      3.01 i/c      3.71 GHz
doubleconversion                        :   285.83 MB/s (+/- 2.0 %)    11.84 Mfloat/s      42.37 i/B  1072.09 i/f (+/- 0.0 %)     11.94 c/B   302.11 c/f (+/- 1.0 %)      3.55 i/c      3.58 GHz
strtod                                  :   177.26 MB/s (+/- 1.6 %)     7.35 Mfloat/s      53.96 i/B  1365.46 i/f (+/- 0.0 %)     19.38 c/B   490.34 c/f (+/- 1.1 %)      2.78 i/c      3.60 GHz
abseil                                  :   544.78 MB/s (+/- 4.0 %)    22.57 Mfloat/s      25.27 i/B   639.51 i/f (+/- 0.0 %)      6.27 c/B   158.59 c/f (+/- 2.7 %)      4.03 i/c      3.58 GHz
fastfloat                               :   629.08 MB/s (+/- 2.0 %)    26.07 Mfloat/s      21.08 i/B   533.39 i/f (+/- 0.0 %)      5.49 c/B   138.99 c/f (+/- 1.0 %)      3.84 i/c      3.62 GHz
wuffs                                   :    21.53 MB/s (+/- 1.5 %)     0.89 Mfloat/s     490.44 i/B 12410.66 i/f (+/- 0.0 %)    160.77 c/B  4068.41 c/f (+/- 0.7 %)      3.05 i/c      3.63 GHz

So, while changing the fallback algorithm to using big-integer arithmetic did improve performance by ~100MB/s, I'm not sure it explain why Wuffs is so slow comparatively.

nigeltao commented 1 year ago

Running the bisect further back than fast_float v2.0.0 leads to https://github.com/fastfloat/fast_float/commit/05ad45dfb5a041106f6e4c385f3ab2a0ea418fbd "Let us try the long path" being the difference. The key part of that patch is:

diff --git a/include/fast_float/simple_decimal_conversion.h b/include/fast_float/simple_decimal_conversion.h
index 410ba05..ef1f0ad 100644
--- a/include/fast_float/simple_decimal_conversion.h
+++ b/include/fast_float/simple_decimal_conversion.h
@@ -368,8 +354,20 @@ adjusted_mantissa compute_float(decimal &d) {
 template <typename binary>
 adjusted_mantissa parse_long_mantissa(const char *first, const char* last) {
     decimal d = parse_decimal(first, last);
+    const uint64_t mantissa = d.to_truncated_mantissa();
+    const int64_t exponent =  d.to_truncated_exponent();
+    adjusted_mantissa am1 = compute_float<binary>(exponent, mantissa);
+    adjusted_mantissa am2 = compute_float<binary>(exponent, mantissa+1);
+    if( am1 == am2 ) { return am1; }
     return compute_float<binary>(d);
 }

Even if we're presented more than 19 digits, we try Eisel-Lemire twice for a lower/upper bound. If the two bounds are equal, return it.

That code was removed in https://github.com/fastfloat/fast_float/commit/192b271c128eac0e48186e6dbb7901d42423b30f "Removing dead code" but the optimization presumably lives on somewhere near https://github.com/fastfloat/fast_float/blob/24374ece716db48f974f49da4aa5851aa371cfa9/include/fast_float/parse_number.h#L208-L212 if you follow the trail through adjusted_mantissa, compute_float, compute_error etc.

nigeltao commented 1 year ago

Thanks @atrieu for the bug report and the informative follow-ups, even when I was confidently wrong. :-)

Your ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt Wuffs times should be faster now. Let me know if they're not.

atrieu commented 1 year ago

Thanks! Wuffs is definitely faster, here are new numbers. tencent-rapidjson-extracted seems to still be an outlier though if you feel like investigating.

> ./build/benchmarks/benchmark -f data/google-double-conversion-extracted.txt
# read 564745 lines
volume = 12.7842 MB
netlib                                  :   117.11 MB/s (+/- 6.1 %)     5.17 Mfloat/s      95.72 i/B  2272.09 i/f (+/- 0.0 %)     31.37 c/B   744.59 c/f (+/- 3.9 %)      3.05 i/c      3.85 GHz
doubleconversion                        :   298.31 MB/s (+/- 2.3 %)    13.18 Mfloat/s      43.91 i/B  1042.18 i/f (+/- 0.0 %)     12.07 c/B   286.39 c/f (+/- 0.9 %)      3.64 i/c      3.77 GHz
strtod                                  :   174.28 MB/s (+/- 4.0 %)     7.70 Mfloat/s      56.60 i/B  1343.55 i/f (+/- 0.0 %)     20.39 c/B   483.88 c/f (+/- 1.8 %)      2.78 i/c      3.73 GHz
abseil                                  :   560.13 MB/s (+/- 3.3 %)    24.74 Mfloat/s      26.22 i/B   622.35 i/f (+/- 0.0 %)      6.43 c/B   152.66 c/f (+/- 1.6 %)      4.08 i/c      3.78 GHz
fastfloat                               :   748.82 MB/s (+/- 2.3 %)    33.08 Mfloat/s      18.68 i/B   443.46 i/f (+/- 0.0 %)      4.78 c/B   113.55 c/f (+/- 0.8 %)      3.91 i/c      3.76 GHz
wuffs                                   :   160.94 MB/s (+/- 3.0 %)     7.11 Mfloat/s      67.34 i/B  1598.37 i/f (+/- 0.0 %)     23.11 c/B   548.58 c/f (+/- 1.5 %)      2.91 i/c      3.90 GHz
> ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt
# read 599458 lines
volume = 14.4666 MB
netlib                                  :   131.34 MB/s (+/- 3.5 %)     5.44 Mfloat/s      86.69 i/B  2193.64 i/f (+/- 0.0 %)     28.49 c/B   720.89 c/f (+/- 0.3 %)      3.04 i/c      3.92 GHz
doubleconversion                        :   316.52 MB/s (+/- 3.0 %)    13.12 Mfloat/s      42.37 i/B  1072.09 i/f (+/- 0.0 %)     11.75 c/B   297.43 c/f (+/- 0.8 %)      3.60 i/c      3.90 GHz
strtod                                  :   186.76 MB/s (+/- 4.4 %)     7.74 Mfloat/s      53.96 i/B  1365.46 i/f (+/- 0.0 %)     19.40 c/B   490.92 c/f (+/- 1.7 %)      2.78 i/c      3.80 GHz
abseil                                  :   596.31 MB/s (+/- 3.3 %)    24.71 Mfloat/s      25.27 i/B   639.51 i/f (+/- 0.0 %)      6.27 c/B   158.61 c/f (+/- 0.8 %)      4.03 i/c      3.92 GHz
fastfloat                               :   759.54 MB/s (+/- 1.8 %)    31.47 Mfloat/s      18.42 i/B   466.17 i/f (+/- 0.0 %)      4.70 c/B   118.91 c/f (+/- 0.6 %)      3.92 i/c      3.74 GHz
wuffs                                   :   172.83 MB/s (+/- 2.0 %)     7.16 Mfloat/s      63.21 i/B  1599.60 i/f (+/- 0.0 %)     21.52 c/B   544.48 c/f (+/- 0.4 %)      2.94 i/c      3.90 GHz
> ./build/benchmarks/benchmark -f data/tencent-rapidjson-extracted.txt
# read 3563 lines
volume = 0.0328283 MB
netlib                                  :   176.31 MB/s (+/- 53.5 %)    19.14 Mfloat/s      35.00 i/B   338.14 i/f (+/- 0.0 %)     12.88 c/B   124.45 c/f (+/- 5.2 %)      2.72 i/c      2.38 GHz
doubleconversion                        :   194.22 MB/s (+/- 17.1 %)    21.08 Mfloat/s      58.26 i/B   562.90 i/f (+/- 0.0 %)     18.12 c/B   175.09 c/f (+/- 2.9 %)      3.21 i/c      3.69 GHz
strtod                                  :   138.89 MB/s (+/- 4.7 %)    15.07 Mfloat/s      74.26 i/B   717.41 i/f (+/- 0.0 %)     25.35 c/B   244.94 c/f (+/- 2.5 %)      2.93 i/c      3.69 GHz
abseil                                  :   265.08 MB/s (+/- 5.6 %)    28.77 Mfloat/s      47.16 i/B   455.59 i/f (+/- 0.0 %)     13.29 c/B   128.38 c/f (+/- 3.5 %)      3.55 i/c      3.69 GHz
fastfloat                               :   622.92 MB/s (+/- 9.2 %)    67.61 Mfloat/s      20.78 i/B   200.80 i/f (+/- 0.0 %)      5.66 c/B    54.65 c/f (+/- 6.7 %)      3.67 i/c      3.69 GHz
wuffs                                   :    73.49 MB/s (+/- 4.3 %)     7.98 Mfloat/s     154.97 i/B  1497.23 i/f (+/- 0.0 %)     49.19 c/B   475.23 c/f (+/- 1.1 %)      3.15 i/c      3.79 GHz
> ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib                                  :   287.27 MB/s (+/- 6.7 %)    16.51 Mfloat/s      31.95 i/B   582.90 i/f (+/- 0.0 %)     12.43 c/B   226.85 c/f (+/- 1.0 %)      2.57 i/c      3.74 GHz
doubleconversion                        :   260.15 MB/s (+/- 3.4 %)    14.95 Mfloat/s      52.52 i/B   958.32 i/f (+/- 0.0 %)     13.76 c/B   251.00 c/f (+/- 1.0 %)      3.82 i/c      3.75 GHz
strtod                                  :   138.97 MB/s (+/- 5.0 %)     7.99 Mfloat/s      70.11 i/B  1279.24 i/f (+/- 0.0 %)     26.11 c/B   476.46 c/f (+/- 1.9 %)      2.68 i/c      3.81 GHz
abseil                                  :   420.22 MB/s (+/- 4.4 %)    24.15 Mfloat/s      31.15 i/B   568.47 i/f (+/- 0.0 %)      8.53 c/B   155.66 c/f (+/- 1.8 %)      3.65 i/c      3.76 GHz
fastfloat                               :  1075.84 MB/s (+/- 5.9 %)    61.82 Mfloat/s      14.14 i/B   257.92 i/f (+/- 0.0 %)      3.36 c/B    61.32 c/f (+/- 2.4 %)      4.21 i/c      3.79 GHz
wuffs                                   :   865.96 MB/s (+/- 5.9 %)    49.76 Mfloat/s      16.96 i/B   309.50 i/f (+/- 0.0 %)      4.18 c/B    76.18 c/f (+/- 2.6 %)      4.06 i/c      3.79 GHz