Closed atrieu closed 1 year ago
How does C++17's https://en.cppreference.com/w/cpp/header/charconv compare to those?
https://git.sr.ht/~atrieu/simple_fastfloat_benchmark/tree/master/item/README.md
says
cmake -B build .
cmake --build build
./build/benchmarks/benchmark
This builds a debug (non-optimized) binary instead of a release (optimized) binary. Wuffs differs from the other libraries you're measuring (netlib, doubleconversion, etc) in that you're building Wuffs from source. The other libraries are pre-built.
I'm not very familiar with cmake but try replacing
cmake -B build .
with
cmake -B build -DCMAKE_BUILD_TYPE=Release .
Basically, running grep -r ^CXX_FLAGS your_build_directory
should see CXX_FLAGS = -O3 -DNDEBUG
. In your benchmark.cpp
file, you can also add these lines:
#ifndef NDEBUG
#error "Benchmarks should only be compiled with full optimization."
#endif
CC @lemire in case there's anything worth backporting to the upstream simple_fastfloat_benchmark
repo. I'll repeat that I'm not very familiar with cmake. I also don't have a Windows system readily available, so I can't test out its "Build on Windows" instructions.
The expected file format, in simple_fastfloat_benchmark, is a list of ASCII numbers, one per line, like so...
-65.613616999999977
43.420273000000009
-65.619720000000029
43.418052999999986
-65.625
43.421379000000059
-65.636123999999882
43.449714999999969
-65.633056999999951
43.474709000000132
-65.611389000000031
43.513054000000068
-65.605835000000013
43.516105999999979
-65.598343
43.515830999999935
-65.566101000000003
43.508331000000055
...
I am not sure what happens if you have some other format. I would consider the benchmark results invalid.
The other libraries are pre-built.
The fast_float library is header-only. Except for strtod, everything is built by cmake.
This builds a debug (non-optimized) binary instead of a release (optimized) binary
You'd be correct if everything was "by default", but the CMakeLists.txt file overrides the default... You can check that in simple_fastfloat_benchmark, the default build is Release:
This does not work under Windows, but the README file provides the instructions:
I'd like to suggest (though I did not investigate) that the atrocious results regarding wuffs are not meaningful. They are likely caused by pointing the software at unexpected file formats.
Though I am not very explicit in the README, it does say that we try to parse each line as a number:
Please make sure that this is what you are pointing the software at.
I only followed the instructions from the original README, so it should be compiled in Release mode as I understand. I do use the benchmarks on correct files I think, e.g.,
.0
.0000
0
0.0
0.00
0.000
0.00000
0.00000000000000000000000000000000000000000000000000000000000
0.000000000000000000000000000000000000000000000000000000000000
0.00000e+0
0.00e+0
0.0e+0
0E1
0E11328
0E3
0E4
0E6
0E8
0E86
0e+0
0e+59
1e-325
2.5e-324
40000e-328
4e-324
5e-324
3.6118391954597633033930746e-310
4.8980019078642963771876647e-310
1.0263847514109591822395668e-309
1.2191449367984545884039792e-309
1.3586641673566107715029324e-309
1.5947055924688004253007524e-309
1.6597654066767748210494446e-309
1.8594943567529094619479745e-309
1.9852357578498988925763729e-309
2.1233738662160330012710983e-309
These were obtained by extracting the fourth column from Nigel's data set, i.e., awk '{print $4}' ../../parse-number-fxx-test-data/data/google-double-conversion.txt > google-double-conversion-extracted.txt
The diff on benchmark.cpp
is as follows.
diff --git a/benchmarks/benchmark.cpp b/benchmarks/benchmark.cpp
index 0246082..3380eba 100644
--- a/benchmarks/benchmark.cpp
+++ b/benchmarks/benchmark.cpp
@@ -3,6 +3,12 @@
#include "absl/strings/numbers.h"
#include "fast_float/fast_float.h"
+#define WUFFS_IMPLEMENTATION
+#define WUFFS_CONFIG__STATIC_FUNCTIONS
+#define WUFFS_CONFIG__MODULES
+#define WUFFS_CONFIG__MODULE__BASE
+#include "wuffs-unsupported-snapshot.c"
+
#ifdef ENABLE_RYU
#include "ryu_parse.h"
#endif
@@ -128,6 +134,28 @@ double findmax_strtod(std::vector<std::string> &s) {
}
return answer;
}
+
+uint32_t wuffs_options = WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_UNDERSCORES \
+ | WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_MULTIPLE_LEADING_ZEROES;
+
+double findmax_wuffs(std::vector<std::string> &s) {
+ double answer = 0;
+ double x = 0;
+ for (std::string &st : s) {
+ wuffs_base__slice_u8 sx = {
+ .ptr = (uint8_t*)st.data(),
+ .len = st.size()
+ };
+ wuffs_base__result_f64 res = wuffs_base__parse_number_f64(sx, wuffs_options);
+ if (res.status.repr) {
+ throw std::runtime_error("bug in findmax_wuffs");
+ }
+ x = res.value;
+ answer = answer > x ? answer : x;
+ }
+ return answer;
+}
+
// Why not `|| __cplusplus > 201703L`? Because GNU libstdc++ does not have
// float parsing for std::from_chars.
#if defined(_MSC_VER)
@@ -292,6 +320,7 @@ void process(std::vector<std::string> &lines, size_t volume) {
#endif
pretty_print(volume, lines.size(), "abseil", time_it_ns(lines, findmax_absl_from_chars, repeat));
pretty_print(volume, lines.size(), "fastfloat", time_it_ns(lines, findmax_fastfloat, repeat));
+ pretty_print(volume, lines.size(), "wuffs", time_it_ns(lines, findmax_wuffs, repeat));
#ifdef FROM_CHARS_AVAILABLE_MAYBE
pretty_print(volume, lines.size(), "from_chars", time_it_ns(lines, findmax_from_chars, repeat));
#endif
@@ -308,6 +337,7 @@ void fileload(const char *filename) {
lines.reserve(10000); // let us reserve plenty of memory.
size_t volume = 0;
while (getline(inputfile, line)) {
+ if (0 < line.size() && line[0] == '.') { line = "0" + line; }
volume += line.size();
lines.push_back(line);
}
The define
s are the same as in manual-test-parse-number-f64.cc
so hopefully they are correct. I think I'm correctly calling wuffs_base__parse_number_f64
(?) It should throw an error otherwise. The last modification is to avoid having a leading dot as wuffs' implementation doesn't support it, but this is not part of the performance measure anyway so shouldn't affect the numbers.
I only followed the instructions from the original README, so it should be compiled in Release mode as I understand.
It should yes.
20 MB/s is very slow, however. Have you do some profiling? Just disable all kernels, and run the benchmark with perf record <command line>
followed by perf report
. You should be able to see something glaring at you.
No, I had not done any profiling as I don't really know much about it. But here are the results from using perf
For comparison, on data/canada.txt
where wuffs performs more as would be expected.
I'm not sure I understand what this all means, are the small_xshift
functions too slow ?
"Debug vs Release" compiler flags was an incorrect guess. Sorry.
Digging deeper into it, test data like data/ulfjack-ryu-extracted.txt
contains many lines like 1.0168286519992372611942638e+100
. Crucially, this has more than 19 significant digits, so the Eisel-Lemire algorithm, the fastest algorithm, does not apply. Both Wuffs and fast_float fail over to a slower fallback algorithm.
Wuffs' fallback algorithm is Simple Decimal Conversion. fast_float also used to fall back to SDC, but more recently uses the big-integer arithmetic algorithm. I'm not familiar with this newer algorithm but it's presumably faster than SDC.
(The parse-number-fxx-test-data test suite is more about testing correctness than testing performance. It isn't about collecting a representative sample of real world numbers. For example, when printing float64 numbers, 19 significant digits is often enough.)
I do understand that data/ulfjack-ryu-extracted.txt
is not really representative of real world numbers since Wuffs performs similarly to fast_float on canada.txt
for instance. But, for the record, here are the results when using tag v2.0.0
(last release still using the SDC algorithm) of fast_float on data/ulfjack-ryu-extracted.txt
.
# read 599458 lines
volume = 14.4666 MB
netlib : 122.65 MB/s (+/- 5.0 %) 5.08 Mfloat/s 86.69 i/B 2193.64 i/f (+/- 0.0 %) 28.83 c/B 729.60 c/f (+/- 2.7 %) 3.01 i/c 3.71 GHz
doubleconversion : 285.83 MB/s (+/- 2.0 %) 11.84 Mfloat/s 42.37 i/B 1072.09 i/f (+/- 0.0 %) 11.94 c/B 302.11 c/f (+/- 1.0 %) 3.55 i/c 3.58 GHz
strtod : 177.26 MB/s (+/- 1.6 %) 7.35 Mfloat/s 53.96 i/B 1365.46 i/f (+/- 0.0 %) 19.38 c/B 490.34 c/f (+/- 1.1 %) 2.78 i/c 3.60 GHz
abseil : 544.78 MB/s (+/- 4.0 %) 22.57 Mfloat/s 25.27 i/B 639.51 i/f (+/- 0.0 %) 6.27 c/B 158.59 c/f (+/- 2.7 %) 4.03 i/c 3.58 GHz
fastfloat : 629.08 MB/s (+/- 2.0 %) 26.07 Mfloat/s 21.08 i/B 533.39 i/f (+/- 0.0 %) 5.49 c/B 138.99 c/f (+/- 1.0 %) 3.84 i/c 3.62 GHz
wuffs : 21.53 MB/s (+/- 1.5 %) 0.89 Mfloat/s 490.44 i/B 12410.66 i/f (+/- 0.0 %) 160.77 c/B 4068.41 c/f (+/- 0.7 %) 3.05 i/c 3.63 GHz
So, while changing the fallback algorithm to using big-integer arithmetic did improve performance by ~100MB/s, I'm not sure it explain why Wuffs is so slow comparatively.
Running the bisect further back than fast_float v2.0.0
leads to https://github.com/fastfloat/fast_float/commit/05ad45dfb5a041106f6e4c385f3ab2a0ea418fbd "Let us try the long path" being the difference. The key part of that patch is:
diff --git a/include/fast_float/simple_decimal_conversion.h b/include/fast_float/simple_decimal_conversion.h
index 410ba05..ef1f0ad 100644
--- a/include/fast_float/simple_decimal_conversion.h
+++ b/include/fast_float/simple_decimal_conversion.h
@@ -368,8 +354,20 @@ adjusted_mantissa compute_float(decimal &d) {
template <typename binary>
adjusted_mantissa parse_long_mantissa(const char *first, const char* last) {
decimal d = parse_decimal(first, last);
+ const uint64_t mantissa = d.to_truncated_mantissa();
+ const int64_t exponent = d.to_truncated_exponent();
+ adjusted_mantissa am1 = compute_float<binary>(exponent, mantissa);
+ adjusted_mantissa am2 = compute_float<binary>(exponent, mantissa+1);
+ if( am1 == am2 ) { return am1; }
return compute_float<binary>(d);
}
Even if we're presented more than 19 digits, we try Eisel-Lemire twice for a lower/upper bound. If the two bounds are equal, return it.
That code was removed in
https://github.com/fastfloat/fast_float/commit/192b271c128eac0e48186e6dbb7901d42423b30f "Removing dead code" but the optimization presumably lives on somewhere near https://github.com/fastfloat/fast_float/blob/24374ece716db48f974f49da4aa5851aa371cfa9/include/fast_float/parse_number.h#L208-L212 if you follow the trail through adjusted_mantissa
, compute_float
, compute_error
etc.
Thanks @atrieu for the bug report and the informative follow-ups, even when I was confidently wrong. :-)
Your ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt
Wuffs times should be faster now. Let me know if they're not.
Thanks! Wuffs is definitely faster, here are new numbers.
tencent-rapidjson-extracted
seems to still be an outlier though if you feel like investigating.
> ./build/benchmarks/benchmark -f data/google-double-conversion-extracted.txt
# read 564745 lines
volume = 12.7842 MB
netlib : 117.11 MB/s (+/- 6.1 %) 5.17 Mfloat/s 95.72 i/B 2272.09 i/f (+/- 0.0 %) 31.37 c/B 744.59 c/f (+/- 3.9 %) 3.05 i/c 3.85 GHz
doubleconversion : 298.31 MB/s (+/- 2.3 %) 13.18 Mfloat/s 43.91 i/B 1042.18 i/f (+/- 0.0 %) 12.07 c/B 286.39 c/f (+/- 0.9 %) 3.64 i/c 3.77 GHz
strtod : 174.28 MB/s (+/- 4.0 %) 7.70 Mfloat/s 56.60 i/B 1343.55 i/f (+/- 0.0 %) 20.39 c/B 483.88 c/f (+/- 1.8 %) 2.78 i/c 3.73 GHz
abseil : 560.13 MB/s (+/- 3.3 %) 24.74 Mfloat/s 26.22 i/B 622.35 i/f (+/- 0.0 %) 6.43 c/B 152.66 c/f (+/- 1.6 %) 4.08 i/c 3.78 GHz
fastfloat : 748.82 MB/s (+/- 2.3 %) 33.08 Mfloat/s 18.68 i/B 443.46 i/f (+/- 0.0 %) 4.78 c/B 113.55 c/f (+/- 0.8 %) 3.91 i/c 3.76 GHz
wuffs : 160.94 MB/s (+/- 3.0 %) 7.11 Mfloat/s 67.34 i/B 1598.37 i/f (+/- 0.0 %) 23.11 c/B 548.58 c/f (+/- 1.5 %) 2.91 i/c 3.90 GHz
> ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt
# read 599458 lines
volume = 14.4666 MB
netlib : 131.34 MB/s (+/- 3.5 %) 5.44 Mfloat/s 86.69 i/B 2193.64 i/f (+/- 0.0 %) 28.49 c/B 720.89 c/f (+/- 0.3 %) 3.04 i/c 3.92 GHz
doubleconversion : 316.52 MB/s (+/- 3.0 %) 13.12 Mfloat/s 42.37 i/B 1072.09 i/f (+/- 0.0 %) 11.75 c/B 297.43 c/f (+/- 0.8 %) 3.60 i/c 3.90 GHz
strtod : 186.76 MB/s (+/- 4.4 %) 7.74 Mfloat/s 53.96 i/B 1365.46 i/f (+/- 0.0 %) 19.40 c/B 490.92 c/f (+/- 1.7 %) 2.78 i/c 3.80 GHz
abseil : 596.31 MB/s (+/- 3.3 %) 24.71 Mfloat/s 25.27 i/B 639.51 i/f (+/- 0.0 %) 6.27 c/B 158.61 c/f (+/- 0.8 %) 4.03 i/c 3.92 GHz
fastfloat : 759.54 MB/s (+/- 1.8 %) 31.47 Mfloat/s 18.42 i/B 466.17 i/f (+/- 0.0 %) 4.70 c/B 118.91 c/f (+/- 0.6 %) 3.92 i/c 3.74 GHz
wuffs : 172.83 MB/s (+/- 2.0 %) 7.16 Mfloat/s 63.21 i/B 1599.60 i/f (+/- 0.0 %) 21.52 c/B 544.48 c/f (+/- 0.4 %) 2.94 i/c 3.90 GHz
> ./build/benchmarks/benchmark -f data/tencent-rapidjson-extracted.txt
# read 3563 lines
volume = 0.0328283 MB
netlib : 176.31 MB/s (+/- 53.5 %) 19.14 Mfloat/s 35.00 i/B 338.14 i/f (+/- 0.0 %) 12.88 c/B 124.45 c/f (+/- 5.2 %) 2.72 i/c 2.38 GHz
doubleconversion : 194.22 MB/s (+/- 17.1 %) 21.08 Mfloat/s 58.26 i/B 562.90 i/f (+/- 0.0 %) 18.12 c/B 175.09 c/f (+/- 2.9 %) 3.21 i/c 3.69 GHz
strtod : 138.89 MB/s (+/- 4.7 %) 15.07 Mfloat/s 74.26 i/B 717.41 i/f (+/- 0.0 %) 25.35 c/B 244.94 c/f (+/- 2.5 %) 2.93 i/c 3.69 GHz
abseil : 265.08 MB/s (+/- 5.6 %) 28.77 Mfloat/s 47.16 i/B 455.59 i/f (+/- 0.0 %) 13.29 c/B 128.38 c/f (+/- 3.5 %) 3.55 i/c 3.69 GHz
fastfloat : 622.92 MB/s (+/- 9.2 %) 67.61 Mfloat/s 20.78 i/B 200.80 i/f (+/- 0.0 %) 5.66 c/B 54.65 c/f (+/- 6.7 %) 3.67 i/c 3.69 GHz
wuffs : 73.49 MB/s (+/- 4.3 %) 7.98 Mfloat/s 154.97 i/B 1497.23 i/f (+/- 0.0 %) 49.19 c/B 475.23 c/f (+/- 1.1 %) 3.15 i/c 3.79 GHz
> ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib : 287.27 MB/s (+/- 6.7 %) 16.51 Mfloat/s 31.95 i/B 582.90 i/f (+/- 0.0 %) 12.43 c/B 226.85 c/f (+/- 1.0 %) 2.57 i/c 3.74 GHz
doubleconversion : 260.15 MB/s (+/- 3.4 %) 14.95 Mfloat/s 52.52 i/B 958.32 i/f (+/- 0.0 %) 13.76 c/B 251.00 c/f (+/- 1.0 %) 3.82 i/c 3.75 GHz
strtod : 138.97 MB/s (+/- 5.0 %) 7.99 Mfloat/s 70.11 i/B 1279.24 i/f (+/- 0.0 %) 26.11 c/B 476.46 c/f (+/- 1.9 %) 2.68 i/c 3.81 GHz
abseil : 420.22 MB/s (+/- 4.4 %) 24.15 Mfloat/s 31.15 i/B 568.47 i/f (+/- 0.0 %) 8.53 c/B 155.66 c/f (+/- 1.8 %) 3.65 i/c 3.76 GHz
fastfloat : 1075.84 MB/s (+/- 5.9 %) 61.82 Mfloat/s 14.14 i/B 257.92 i/f (+/- 0.0 %) 3.36 c/B 61.32 c/f (+/- 2.4 %) 4.21 i/c 3.79 GHz
wuffs : 865.96 MB/s (+/- 5.9 %) 49.76 Mfloat/s 16.96 i/B 309.50 i/f (+/- 0.0 %) 4.18 c/B 76.18 c/f (+/- 2.6 %) 4.06 i/c 3.79 GHz
I modified Lemire's
simple_fastfloat_benchmark
to include testing wuffs' f64 parsing implementation. On some benchmarks extracted from Tao'sparse-number-fxx-test-data
, it is the slowest parser, but it does fine on some others.My tests can be found here. I'm not a C/C++ expert, so hopefully it's not because of a stupid mistake I made.