strconv: implement the Ryu algorithm to speed up float64->decimal conversion

dvyukov commented 8 years ago

go version devel +15f2d0e Fri May 13 01:19:05 2016 +0000 linux/amd64

func BenchmarkAppendFloatLarge1(b *testing.B) { benchmarkAppendFloat(b, 622666234635.321e-320, 'e', -1, 64) }
func BenchmarkAppendFloatLarge2(b *testing.B) { benchmarkAppendFloat(b, 622666234635.3213e-320, 'e', -1, 64) }
func BenchmarkAppendFloatLarge3(b *testing.B) { benchmarkAppendFloat(b, 622666234635.322e-320, 'e', -1, 64) }

BenchmarkAppendFloatLarge1-48        5000000           276 ns/op
BenchmarkAppendFloatLarge2-48           2000         89589 ns/op
BenchmarkAppendFloatLarge3-48         500000           278 ns/op

dvyukov commented 8 years ago

320 times slower WAT?1!1 100 microseconds. Does it contact FFS (Float-Formatting Service)?

griesemer commented 8 years ago

This is a corner-case of a corner-case - extremely unlikely to matter in real-world apps. We have large-exponent values (unlikely), and for the BenchmarkAppendFloatLarge2-48 we have a slow-path:

In genericFtoa, we end up calling bigFtoa which uses slow bignum arithmetic. In bigFtoa, the call to roundShortest fails to return quickly, and it proceeds doing the slow (but precise) binary-to-decimal mantissa conversion (ftoa.go:257ff).

Probably not worth spending much time on fixing this.

dvyukov commented 8 years ago

Looks like a nice DoS attack vector on any Go software that ever formats floats.

Also, strconv.AppendFloat(dst[:0], 622666234635.3213e-320, 'f', -1, 64) outputs just zeros, but still takes 100+ microseconds.

griesemer commented 8 years ago

cc: @agl for input re: uses of float formatting for DoS attacks. I suspect this might not be unique to Go, and if so, what do other systems do?

agl commented 8 years ago

It's an interesting algorithmic complexity attack. (There have been similar issues in the past.)

It's a bit of a sharp edge, but some values will probably always take a little longer to format. We could put effort into optimising this, but it probably makes the code complex in an already complex area. Perhaps it's worth a comment, but I'm not sure about more than that.

nightlyone commented 8 years ago

How many of these corner case numbers exist? If their number is small, maybe a static lookup table could work....

griesemer commented 8 years ago

@nightlyone Hard to say without very careful (and difficult!) mathematical analysis.

ALTree commented 8 years ago

Well, the fact that Dmitry found one (in a space of 2**64 elements) by fuzzing (I believe?) suggests that the number of slow values is probably not that small.

rsc commented 8 years ago

Basically all floating point printing algorithms use a fast path for most numbers and fall back to a slow path. Go's fast path is the Grisu3 algorithm, which fails for 0.5% of inputs. The fallback is to some slow decimal code that I wrote a long time ago. In practice this is fine. Grisu3 was basically state of the art when it was added here, I believe by @remyoudompheng .

Very recently a new paper came out that gets the fast path down to "all but 45 float64 values", which you can then handle with a table. See Andrysco, Jhala, and Lerner, "Printing Floating-Point Numbers: A Faster, Always Correct Method", POPL 2016.

If someone wants to implement that, great, but I don't think it's a particularly high priority.

cespare commented 8 years ago

@rsc it seems that by the authors' own benchmarks, Errol is 2.4x slower than Grisu3. Does that seem like a reasonable tradeoff for strconv? (I ask because I'm thinking of working on this but I anticipate pushback if the change makes the 95.5% case more than twice as slow. Another alternative would be to use Errol as the Grisu3 fallback but that seems like a lot of code for printing floats.)

ALTree commented 8 years ago

FWIW I have a one-line change that makes the slow path about 35% faster on Dmitry's number:

name                      old time/op  new time/op  delta
FormatFloat/Slowpath64-4  68.6µs ± 0%  44.1µs ± 2%  -35.71%  (p=0.000 n=13+15)

It doesn't fix the problem but it's better than nothing I guess:

https://go-review.googlesource.com/#/c/30099

remyoudompheng commented 8 years ago

@cespare Having good performance for the majority of cases can be critical for some users. In my applications I would not withstand a 2.4x slowdown (however using Go's code generation the ratio could turn out to be different). If I understand correctly Errol can remove the need for big number arithmetic entirely, so it could be a fallback to Grisu3 without so much code size inflation, and it would be quite useful because if 1 number out of 200 is 300 times slower to process, the impact is far from negligible.

bmkessler commented 7 years ago

@remyoudompheng Errol does not eliminate the need for big number arithmetic entirely. For floating point numbers in the range [2^54, 2^131) it falls back to exact integer arithmetic. The algorithm is actually basically the same as previous conversion algorithms except outside that range it uses double-double (2x float64) to calculate. The double-double has enough precision except for a few enumerated cases that are looked up in a table. The compute_digits algorithm in the paper for the shortest output produces the largest correctly rounded decimal in the rounding range, so that would need to be modified to produce the closest correctly rounded decimal. Also, note that the paper does not contain any error analysis for float32. I think these issues might indicate that Errol is not ready for inclusion in the standard library.

Figure 13(c) in the paper shows the fall back performance they are benchmarking Errol against and it shows ~ linear dependence on the floating point exponent. Benchmarking the Go fall back, which I think is using the same Dragon4 algorithm, shows ~ quadratic dependence on the exponent. So the current slow path should be able to speed up quite a bit.

bmkessler commented 6 years ago

Ryū: fast float-to-string conversion, presentation, github

Just came out and is simpler and faster than grisu3 (~3x in the paper's benchmarks). It requires only integer operations, although for float64 it uses 128-bit integers that should be made more efficient by #24813 It also requires a few small lookup tables:

|Type   |  B0|  B1| #Entries| Total Memory|
-------------------------------------------
|Float32|  60|  63|       78|     624 Byte|
|Float64| 124| 124|      617|   9,872 Byte|

Since it is both simpler and faster, I would recommend investigating the ryu algorithm as the replacement for grisu3 instead of errol.

cespare commented 5 years ago

I have implemented Ryu in Go at https://github.com/cespare/ryu.

On my machine it is significantly faster than strconv for all the inputs I've tried.

name                                     old time/op    new time/op    delta
FormatFloat32-12                            128ns ± 1%      50ns ± 2%  -60.82%  (p=0.000 n=7+8)
FormatFloat64-12                            129ns ± 4%      65ns ± 5%  -49.54%  (p=0.000 n=7+8)
AppendFloat32/0e+00-12                     24.4ns ± 1%     3.0ns ± 1%  -87.88%  (p=0.000 n=8+8)
AppendFloat32/1e+00-12                     26.5ns ± 1%    13.2ns ± 3%  -49.98%  (p=0.000 n=8+8)
AppendFloat32/3e-01-12                     52.2ns ± 1%    32.5ns ± 2%  -37.73%  (p=0.000 n=8+8)
AppendFloat32/1e+06-12                     41.2ns ± 1%    17.9ns ± 1%  -56.45%  (p=0.000 n=8+7)
AppendFloat32/-1.2345e+02-12               83.3ns ± 2%    34.2ns ± 1%  -58.90%  (p=0.000 n=8+8)
AppendFloat64/0e+00-12                     24.5ns ± 2%     3.3ns ± 2%  -86.50%  (p=0.000 n=8+8)
AppendFloat64/1e+00-12                     26.9ns ± 1%    14.5ns ± 1%  -46.06%  (p=0.001 n=8+6)
AppendFloat64/3e-01-12                     53.0ns ± 1%    42.5ns ± 0%  -19.75%  (p=0.001 n=8+6)
AppendFloat64/1e+06-12                     41.4ns ± 1%    21.1ns ± 1%  -49.05%  (p=0.000 n=8+8)
AppendFloat64/-1.2345e+02-12               83.8ns ± 1%    43.3ns ± 1%  -48.32%  (p=0.000 n=8+8)
AppendFloat64/6.226662346353213e-309-12    25.5µs ± 1%     0.0µs ± 1%  -99.84%  (p=0.000 n=8+8)

(That last one is @dvyukov's pathological case.)

While some inputs are faster than others, there aren't any drastically slower fallback paths.

    ryu_test.go:279: after sampling 50000 float64s:
        ryu:               min = 2ns  max = 90ns     median = 41ns   mean = 41ns
        strconv (stdlib):  min = 8ns  max = 25845ns  median = 106ns  mean = 154ns

It's true that Ryu requires some lookup tables. However, in his C implementation @ulfjack (the Ryu author) has a size-optimized version that uses much smaller lookup tables in exchange for a little more CPU cost. I've implemented that as well and it's still faster than the current strconv implementation in all the cases I've looked at:

name                                     old time/op    new time/op    delta
FormatFloat32-12                            129ns ± 2%      49ns ± 1%  -61.72%  (p=0.000 n=8+8)
FormatFloat64-12                            130ns ± 3%      72ns ± 5%  -44.32%  (p=0.000 n=7+8)
AppendFloat32/0e+00-12                     24.5ns ± 2%     3.0ns ± 1%  -87.83%  (p=0.000 n=8+8)
AppendFloat32/1e+00-12                     26.4ns ± 1%    13.1ns ± 1%  -50.26%  (p=0.000 n=7+8)
AppendFloat32/3e-01-12                     52.6ns ± 2%    32.4ns ± 1%  -38.43%  (p=0.000 n=8+8)
AppendFloat32/1e+06-12                     41.3ns ± 2%    17.6ns ± 1%  -57.51%  (p=0.000 n=8+8)
AppendFloat32/-1.2345e+02-12               83.5ns ± 1%    34.4ns ± 1%  -58.82%  (p=0.000 n=8+8)
AppendFloat64/0e+00-12                     24.6ns ± 2%     3.3ns ± 1%  -86.63%  (p=0.000 n=8+8)
AppendFloat64/1e+00-12                     26.7ns ± 1%    14.6ns ± 4%  -45.51%  (p=0.000 n=8+8)
AppendFloat64/3e-01-12                     52.7ns ± 1%    50.0ns ± 1%   -5.17%  (p=0.000 n=8+8)
AppendFloat64/1e+06-12                     41.2ns ± 1%    21.1ns ± 2%  -48.61%  (p=0.000 n=7+8)
AppendFloat64/-1.2345e+02-12               83.7ns ± 1%    50.9ns ± 1%  -39.17%  (p=0.000 n=8+8)
AppendFloat64/6.226662346353213e-309-12    25.8µs ± 2%     0.0µs ± 1%  -99.81%  (p=0.000 n=8+8)

Based on these promising results and the comments by @bmkessler above, it seems like switching strconv to use Ryu is the best option available to us. I've taken the liberty of retitling the issue accordingly. I intend to work on this during the Go 1.13 cycle.

remyoudompheng commented 5 years ago

I'm also in favour of switching. I also have some version on my side in the works, but I'm taking a different path by making it look like the old Grisu code. I'll try to show it somewhere.

cyberphone commented 5 years ago

https://github.com/golang/go/issues/29491#issuecomment-453848234

remyoudompheng commented 5 years ago

I have pushed my own things in the following branch https://github.com/remyoudompheng/go/tree/ryu

A few notes:

It is implemented in strconv
It is implemented from scratch, but of course shows great similarity with the reference implementation and @cespare version
It does not support float32 in this version
I have implemented an addition to it: the fixed precision formatting, which is faster in corner cases (notably 16 digits), and can handle float32
The code is not optimized with particular tricks for production of individual digits
I would like to add a function for "atof" using this framework, as I did with Grisu3 in the past, I don't know what kind of speedups we can expect there
I would like to write unit tests which prove the fundamental mathematical statements of @ulfjack's paper They might be needed to validate the extension of the algorithm to fixed precision or the validity domain of 128-bit arithmetic when converting the other way (atof).

remyoudompheng commented 5 years ago

I have also prepared a bunch of denormals very hard to parse when printed in their shortest form (they are also hard for AppendFloat). The last one is "kind of" pathological but since it has few digits, the existing code handles it fine.

BenchmarkAtof/1.68514038588815e-309-4                      50000         18384 ns/op
BenchmarkAtof/9.11691642378e-312-4                         50000         19854 ns/op
BenchmarkAtof/1.62420278e-315-4                         10000000            71.4 ns/op

BenchmarkAppendFloatHard/341076211242912p-1074-4       50000         32519 ns/op
BenchmarkAppendFloatHard/1845284427387p-1074-4         50000         32516 ns/op
BenchmarkAppendFloatHard/328742302p-1074-4          20000000           103 ns/op

The corresponding numbers are:

9.11691642378e-312 = 0x1ada385d67b.7fffffff5d9...p-1074
1.62420278e-315 = 0x1398359e.7fffe022p-1074

remyoudompheng commented 5 years ago

Hello, I have pushed to my branch preliminary support for ParseFloat (commit https://github.com/golang/go/commit/03509643cb1a9d207fb790e30006a1d8d03d7efb). When feeding in the shortest representations of float64s, only 93% of them can be processed because I have to add support (and prove correctness) for 17-digit strings.

Benchmarks of the computational part (convert uint64 mantissa and power of 10 exponent to float64). "Old" is the old code which tries simple float64 multiply, then Grisu, then big integers. "New" tries only Ryu you can notice the extremely constant running time.

benchmark                                  old ns/op     new ns/op     delta
BenchmarkAtof/33909e0-4                    5.64          14.9          +164.18%
BenchmarkAtof/3397784e-4-4                 6.58          20.3          +208.51%
BenchmarkAtof/509e73-4                     36.9          18.0          -51.22%
BenchmarkAtof/6226662346353213e-324-4      39.8          18.5          -53.52%
BenchmarkAtof/6808957268280643e116-4       5684          18.0          -99.68%
BenchmarkAtof/4334126125515466e-225-4      9829          18.0          -99.82%
BenchmarkAtof/168514038588815e-323-4       18157         18.6          -99.90%
BenchmarkAtof/911691642378e-323-4          19622         18.6          -99.91%
BenchmarkAtof/162420278e-323-4             40.0          18.5          -53.75%
BenchmarkAtof/22250738585072011e-324-4     40.0          18.6          -53.50%

Benchmark of ParseFloat: in this benchmark the Grisu parsing is replaced by the Ryu routine (keeping the "float64 multiply" fast path). This implementation does not try long mantissas (so it falls back to multi-precision arithmetic). You can notice the high cost of boilerplate compared to the pure computation.

benchmark                                       old ns/op     new ns/op     delta
BenchmarkAtof/33909-4                           24.1          24.0          -0.41%
BenchmarkAtof/339.7784-4                        29.1          29.1          +0.00%
BenchmarkAtof/-5.09e75-4                        63.4          46.2          -27.13%
BenchmarkAtof/123456789123456789123456789-4     98.8          999           +911.13%
BenchmarkAtof/622666234635.3213e-320-4          82.8          64.1          -22.58%
BenchmarkAtof/33909#01-4                        23.5          23.6          +0.43%
BenchmarkAtof/339.778-4                         26.5          26.4          -0.38%
BenchmarkAtof/12.3456e32-4                      66.7          66.2          -0.75%
BenchmarkAtof/100000000000000016777215-4        844           786           -6.87%
BenchmarkAtof/100000000000000016777216-4        693           643           -7.22%
BenchmarkAtof/6808957268280643e116-4            5836          65.5          -98.88%
BenchmarkAtof/4.334126125515466e-210-4          10045         72.3          -99.28%
BenchmarkAtof/1.68514038588815e-309-4           18462         64.8          -99.65%
BenchmarkAtof/9.11691642378e-312-4              19818         58.8          -99.70%
BenchmarkAtof/1.62420278e-315-4                 71.1          54.4          -23.49%
BenchmarkAtof/2.2250738585072011e-308-4         84.2          67.1          -20.31%

ulfjack commented 5 years ago

Nice. Great observation, @remyoudompheng. I haven't published my parsing code yet; really need to finish that. I believe it can be made fast regardless of input string length, i.e., scale linearly.

remyoudompheng commented 5 years ago

As of commit https://github.com/remyoudompheng/go/commit/ed351765ae the Atof implementation is now extended to work whenever the decimal digits form a number that fits in a uint64. It can also handle non ambiguous very long inputs that simply converting the rounded-up mantissa if it yields the same result. This allows to work with more inputs, even a few exceptional corner cases that I have added to tests (all rejected by the "reverse Grisu3")

      // Halfway is 500016682268521616.00000000000001e229
      {"500016682268521616e229", "5.000166822685216e+246", nil}, // 18 digits necessary
      // Halfway is 1873795671212201760.9999999999999998e108
      {"1873795671212201761e108", "1.873795671212202e+126", nil}, // 19 digits (61 bits) necessary
      // Halfway is 10027399025072458413.99999999999998e140
      {"10027399025072458414e140", "1.002739902507246e+159", nil}, // 20 digits (64 bits) necessary

@ulfjack I am interested in knowing whether the "TestRyuNoCarry" is exactly the same proof as in your paper or not. The implementation is now feature complete compared to what Grisu3 was used for (shortest formatting, fixed formatting, parsing).

Final benchmarks:

BenchmarkAtof/33909-4                           23.2          23.0          -0.86%
BenchmarkAtof/339.7784-4                        29.0          28.7          -1.03%
BenchmarkAtof/-5.09e75-4                        63.8          45.2          -29.15%
BenchmarkAtof/18446744073709551608-4            98.0          68.8          -29.80%
BenchmarkAtof/123456789123456789123456789-4     90.7          82.1          -9.48%
BenchmarkAtof/622666234635.3213e-320-4          80.0          59.1          -26.12%
BenchmarkAtof/33909#01-4                        22.9          22.2          -3.06%
BenchmarkAtof/339.778-4                         26.7          26.8          +0.37%
BenchmarkAtof/12.3456e32-4                      66.2          61.4          -7.25%
BenchmarkAtof/2.3399415873862403e69-4           2836          59.0          -97.92%
BenchmarkAtof/500016682268521616e229-4          19610         67.7          -99.65%
BenchmarkAtof/1873795671212201761e108-4         6383          76.2          -98.81%
BenchmarkAtof/10027399025072458414e140-4        7728          79.2          -98.98%
BenchmarkAtof/100000000000000016777215-4        889           833           -6.30%
BenchmarkAtof/100000000000000016777216-4        768           721           -6.12%
BenchmarkAtof/6808957268280643e116-4            5828          57.7          -99.01%
BenchmarkAtof/4.334126125515466e-210-4          9924          59.6          -99.40%
BenchmarkAtof/1.68514038588815e-309-4           18428         58.8          -99.68%
BenchmarkAtof/9.11691642378e-312-4              19888         55.0          -99.72%
BenchmarkAtof/1.62420278e-315-4                 73.7          54.3          -26.32%
BenchmarkAtof/2.2250738585072011e-308-4         82.7          61.1          -26.12%

ulfjack commented 5 years ago

@remyoudompheng I'm afraid I don't understand your question. The proof in the paper only covers binary to decimal conversion, not the other way round. I have since written down an extended proof that shows that the same concepts apply to all source and target bases with minor changes for certain base pairs. I believe that my implementation closely follows the proof.

remyoudompheng commented 5 years ago

@ulfjack I meant that this unit test (https://github.com/remyoudompheng/go/blob/ed351765ae8a8307e4df08a24511ec058b7f7ccc/src/strconv/extfloat2_test.go#L478) aims at embedding a proof of what is paragraph 3.2.3 of your paper, to make the code self-contained. I believe it is essentially equivalent.

cespare commented 5 years ago

@remyoudompheng and I discussed this and he's going to work on creating the CLs for his implementation in strconv. I may help review the changes. I'm changing the assignment accordingly.

remyoudompheng commented 5 years ago

Status update:

I spent some time creating a library of iterators over numbers that are "very hard" to round (https://github.com/remyoudompheng/fptest). That is, numbers that you cannot format by doing simple fixed-precision arithmetic (Grisu3 is 64-bit, Errol is about 104-bit, Ryū is 128-bit). It is an implementation of Farey sequences in Go. It can detect nearly any mistake in rounding or a simple corruption in powers of 10 table (typically not detected by strconv test suite).
I prepared commits nearly ready for submitting CL. I will submit them soon.
- ParseFloat : https://github.com/remyoudompheng/go/commit/c228d1c27a8764eea757b25281567ba1933e8a1e
- FormatFloat (fixed digits) : https://github.com/remyoudompheng/go/commit/7080c0ca79615c6b0d6f26877ee2ee17d12222d0

Shortest formatting is not yet cleaned up.

remyoudompheng commented 5 years ago

Shortest formatting is now included and I added a fix to avoid all long divisions on 32-bit platforms (especially arm where performance was awful). ARM now gets performance gains as well.

gopherbot commented 5 years ago

Change https://golang.org/cl/170078 mentions this issue: strconv: implement Ryū-like algorithm for atof

gopherbot commented 5 years ago

Change https://golang.org/cl/170079 mentions this issue: strconv: implement Ryū-like algorithm for fixed precision ftoa

gopherbot commented 5 years ago

Change https://golang.org/cl/170080 mentions this issue: strconv: Implement Ryū algorithm for ftoa shortest mode

smasher164 commented 4 years ago

Reading @rsc's comment on CL 170078, it looks like these changes were intended to be reviewed early-in-cycle for 1.15. @remyoudompheng is this still the case?

golang / go

strconv: implement the Ryu algorithm to speed up float64->decimal conversion #15672