Closed dvyukov closed 3 years ago
320 times slower WAT?1!1 100 microseconds. Does it contact FFS (Float-Formatting Service)?
This is a corner-case of a corner-case - extremely unlikely to matter in real-world apps. We have large-exponent values (unlikely), and for the BenchmarkAppendFloatLarge2-48 we have a slow-path:
In genericFtoa, we end up calling bigFtoa which uses slow bignum arithmetic. In bigFtoa, the call to roundShortest fails to return quickly, and it proceeds doing the slow (but precise) binary-to-decimal mantissa conversion (ftoa.go:257ff).
Probably not worth spending much time on fixing this.
Looks like a nice DoS attack vector on any Go software that ever formats floats.
Also, strconv.AppendFloat(dst[:0], 622666234635.3213e-320, 'f', -1, 64)
outputs just zeros, but still takes 100+ microseconds.
cc: @agl for input re: uses of float formatting for DoS attacks. I suspect this might not be unique to Go, and if so, what do other systems do?
It's an interesting algorithmic complexity attack. (There have been similar issues in the past.)
It's a bit of a sharp edge, but some values will probably always take a little longer to format. We could put effort into optimising this, but it probably makes the code complex in an already complex area. Perhaps it's worth a comment, but I'm not sure about more than that.
How many of these corner case numbers exist? If their number is small, maybe a static lookup table could work....
@nightlyone Hard to say without very careful (and difficult!) mathematical analysis.
Well, the fact that Dmitry found one (in a space of 2**64 elements) by fuzzing (I believe?) suggests that the number of slow values is probably not that small.
Basically all floating point printing algorithms use a fast path for most numbers and fall back to a slow path. Go's fast path is the Grisu3 algorithm, which fails for 0.5% of inputs. The fallback is to some slow decimal code that I wrote a long time ago. In practice this is fine. Grisu3 was basically state of the art when it was added here, I believe by @remyoudompheng .
Very recently a new paper came out that gets the fast path down to "all but 45 float64 values", which you can then handle with a table. See Andrysco, Jhala, and Lerner, "Printing Floating-Point Numbers: A Faster, Always Correct Method", POPL 2016.
If someone wants to implement that, great, but I don't think it's a particularly high priority.
@rsc it seems that by the authors' own benchmarks, Errol is 2.4x slower than Grisu3. Does that seem like a reasonable tradeoff for strconv? (I ask because I'm thinking of working on this but I anticipate pushback if the change makes the 95.5% case more than twice as slow. Another alternative would be to use Errol as the Grisu3 fallback but that seems like a lot of code for printing floats.)
FWIW I have a one-line change that makes the slow path about 35% faster on Dmitry's number:
name old time/op new time/op delta
FormatFloat/Slowpath64-4 68.6µs ± 0% 44.1µs ± 2% -35.71% (p=0.000 n=13+15)
It doesn't fix the problem but it's better than nothing I guess:
@cespare Having good performance for the majority of cases can be critical for some users. In my applications I would not withstand a 2.4x slowdown (however using Go's code generation the ratio could turn out to be different). If I understand correctly Errol can remove the need for big number arithmetic entirely, so it could be a fallback to Grisu3 without so much code size inflation, and it would be quite useful because if 1 number out of 200 is 300 times slower to process, the impact is far from negligible.
@remyoudompheng Errol does not eliminate the need for big number arithmetic entirely. For floating point numbers in the range [2^54, 2^131) it falls back to exact integer arithmetic. The algorithm is actually basically the same as previous conversion algorithms except outside that range it uses double-double (2x float64) to calculate. The double-double has enough precision except for a few enumerated cases that are looked up in a table. The compute_digits algorithm in the paper for the shortest output produces the largest correctly rounded decimal in the rounding range, so that would need to be modified to produce the closest correctly rounded decimal. Also, note that the paper does not contain any error analysis for float32. I think these issues might indicate that Errol is not ready for inclusion in the standard library.
Figure 13(c) in the paper shows the fall back performance they are benchmarking Errol against and it shows ~ linear dependence on the floating point exponent. Benchmarking the Go fall back, which I think is using the same Dragon4 algorithm, shows ~ quadratic dependence on the exponent. So the current slow path should be able to speed up quite a bit.
Ryū: fast float-to-string conversion, presentation, github
Just came out and is simpler and faster than grisu3 (~3x in the paper's benchmarks). It requires only integer operations, although for float64 it uses 128-bit integers that should be made more efficient by #24813 It also requires a few small lookup tables:
|Type | B0| B1| #Entries| Total Memory|
-------------------------------------------
|Float32| 60| 63| 78| 624 Byte|
|Float64| 124| 124| 617| 9,872 Byte|
Since it is both simpler and faster, I would recommend investigating the ryu algorithm as the replacement for grisu3 instead of errol.
I have implemented Ryu in Go at https://github.com/cespare/ryu.
On my machine it is significantly faster than strconv for all the inputs I've tried.
name old time/op new time/op delta
FormatFloat32-12 128ns ± 1% 50ns ± 2% -60.82% (p=0.000 n=7+8)
FormatFloat64-12 129ns ± 4% 65ns ± 5% -49.54% (p=0.000 n=7+8)
AppendFloat32/0e+00-12 24.4ns ± 1% 3.0ns ± 1% -87.88% (p=0.000 n=8+8)
AppendFloat32/1e+00-12 26.5ns ± 1% 13.2ns ± 3% -49.98% (p=0.000 n=8+8)
AppendFloat32/3e-01-12 52.2ns ± 1% 32.5ns ± 2% -37.73% (p=0.000 n=8+8)
AppendFloat32/1e+06-12 41.2ns ± 1% 17.9ns ± 1% -56.45% (p=0.000 n=8+7)
AppendFloat32/-1.2345e+02-12 83.3ns ± 2% 34.2ns ± 1% -58.90% (p=0.000 n=8+8)
AppendFloat64/0e+00-12 24.5ns ± 2% 3.3ns ± 2% -86.50% (p=0.000 n=8+8)
AppendFloat64/1e+00-12 26.9ns ± 1% 14.5ns ± 1% -46.06% (p=0.001 n=8+6)
AppendFloat64/3e-01-12 53.0ns ± 1% 42.5ns ± 0% -19.75% (p=0.001 n=8+6)
AppendFloat64/1e+06-12 41.4ns ± 1% 21.1ns ± 1% -49.05% (p=0.000 n=8+8)
AppendFloat64/-1.2345e+02-12 83.8ns ± 1% 43.3ns ± 1% -48.32% (p=0.000 n=8+8)
AppendFloat64/6.226662346353213e-309-12 25.5µs ± 1% 0.0µs ± 1% -99.84% (p=0.000 n=8+8)
(That last one is @dvyukov's pathological case.)
While some inputs are faster than others, there aren't any drastically slower fallback paths.
ryu_test.go:279: after sampling 50000 float64s:
ryu: min = 2ns max = 90ns median = 41ns mean = 41ns
strconv (stdlib): min = 8ns max = 25845ns median = 106ns mean = 154ns
It's true that Ryu requires some lookup tables. However, in his C implementation @ulfjack (the Ryu author) has a size-optimized version that uses much smaller lookup tables in exchange for a little more CPU cost. I've implemented that as well and it's still faster than the current strconv implementation in all the cases I've looked at:
name old time/op new time/op delta
FormatFloat32-12 129ns ± 2% 49ns ± 1% -61.72% (p=0.000 n=8+8)
FormatFloat64-12 130ns ± 3% 72ns ± 5% -44.32% (p=0.000 n=7+8)
AppendFloat32/0e+00-12 24.5ns ± 2% 3.0ns ± 1% -87.83% (p=0.000 n=8+8)
AppendFloat32/1e+00-12 26.4ns ± 1% 13.1ns ± 1% -50.26% (p=0.000 n=7+8)
AppendFloat32/3e-01-12 52.6ns ± 2% 32.4ns ± 1% -38.43% (p=0.000 n=8+8)
AppendFloat32/1e+06-12 41.3ns ± 2% 17.6ns ± 1% -57.51% (p=0.000 n=8+8)
AppendFloat32/-1.2345e+02-12 83.5ns ± 1% 34.4ns ± 1% -58.82% (p=0.000 n=8+8)
AppendFloat64/0e+00-12 24.6ns ± 2% 3.3ns ± 1% -86.63% (p=0.000 n=8+8)
AppendFloat64/1e+00-12 26.7ns ± 1% 14.6ns ± 4% -45.51% (p=0.000 n=8+8)
AppendFloat64/3e-01-12 52.7ns ± 1% 50.0ns ± 1% -5.17% (p=0.000 n=8+8)
AppendFloat64/1e+06-12 41.2ns ± 1% 21.1ns ± 2% -48.61% (p=0.000 n=7+8)
AppendFloat64/-1.2345e+02-12 83.7ns ± 1% 50.9ns ± 1% -39.17% (p=0.000 n=8+8)
AppendFloat64/6.226662346353213e-309-12 25.8µs ± 2% 0.0µs ± 1% -99.81% (p=0.000 n=8+8)
Based on these promising results and the comments by @bmkessler above, it seems like switching strconv to use Ryu is the best option available to us. I've taken the liberty of retitling the issue accordingly. I intend to work on this during the Go 1.13 cycle.
I'm also in favour of switching. I also have some version on my side in the works, but I'm taking a different path by making it look like the old Grisu code. I'll try to show it somewhere.
I have pushed my own things in the following branch https://github.com/remyoudompheng/go/tree/ryu
A few notes:
I have also prepared a bunch of denormals very hard to parse when printed in their shortest form (they are also hard for AppendFloat). The last one is "kind of" pathological but since it has few digits, the existing code handles it fine.
BenchmarkAtof/1.68514038588815e-309-4 50000 18384 ns/op BenchmarkAtof/9.11691642378e-312-4 50000 19854 ns/op BenchmarkAtof/1.62420278e-315-4 10000000 71.4 ns/op BenchmarkAppendFloatHard/341076211242912p-1074-4 50000 32519 ns/op BenchmarkAppendFloatHard/1845284427387p-1074-4 50000 32516 ns/op BenchmarkAppendFloatHard/328742302p-1074-4 20000000 103 ns/op
The corresponding numbers are:
9.11691642378e-312 = 0x1ada385d67b.7fffffff5d9...p-1074 1.62420278e-315 = 0x1398359e.7fffe022p-1074
Hello, I have pushed to my branch preliminary support for ParseFloat (commit https://github.com/golang/go/commit/03509643cb1a9d207fb790e30006a1d8d03d7efb). When feeding in the shortest representations of float64s, only 93% of them can be processed because I have to add support (and prove correctness) for 17-digit strings.
Benchmarks of the computational part (convert uint64 mantissa and power of 10 exponent to float64). "Old" is the old code which tries simple float64 multiply, then Grisu, then big integers. "New" tries only Ryu you can notice the extremely constant running time.
benchmark old ns/op new ns/op delta BenchmarkAtof/33909e0-4 5.64 14.9 +164.18% BenchmarkAtof/3397784e-4-4 6.58 20.3 +208.51% BenchmarkAtof/509e73-4 36.9 18.0 -51.22% BenchmarkAtof/6226662346353213e-324-4 39.8 18.5 -53.52% BenchmarkAtof/6808957268280643e116-4 5684 18.0 -99.68% BenchmarkAtof/4334126125515466e-225-4 9829 18.0 -99.82% BenchmarkAtof/168514038588815e-323-4 18157 18.6 -99.90% BenchmarkAtof/911691642378e-323-4 19622 18.6 -99.91% BenchmarkAtof/162420278e-323-4 40.0 18.5 -53.75% BenchmarkAtof/22250738585072011e-324-4 40.0 18.6 -53.50%
Benchmark of ParseFloat: in this benchmark the Grisu parsing is replaced by the Ryu routine (keeping the "float64 multiply" fast path). This implementation does not try long mantissas (so it falls back to multi-precision arithmetic). You can notice the high cost of boilerplate compared to the pure computation.
benchmark old ns/op new ns/op delta BenchmarkAtof/33909-4 24.1 24.0 -0.41% BenchmarkAtof/339.7784-4 29.1 29.1 +0.00% BenchmarkAtof/-5.09e75-4 63.4 46.2 -27.13% BenchmarkAtof/123456789123456789123456789-4 98.8 999 +911.13% BenchmarkAtof/622666234635.3213e-320-4 82.8 64.1 -22.58% BenchmarkAtof/33909#01-4 23.5 23.6 +0.43% BenchmarkAtof/339.778-4 26.5 26.4 -0.38% BenchmarkAtof/12.3456e32-4 66.7 66.2 -0.75% BenchmarkAtof/100000000000000016777215-4 844 786 -6.87% BenchmarkAtof/100000000000000016777216-4 693 643 -7.22% BenchmarkAtof/6808957268280643e116-4 5836 65.5 -98.88% BenchmarkAtof/4.334126125515466e-210-4 10045 72.3 -99.28% BenchmarkAtof/1.68514038588815e-309-4 18462 64.8 -99.65% BenchmarkAtof/9.11691642378e-312-4 19818 58.8 -99.70% BenchmarkAtof/1.62420278e-315-4 71.1 54.4 -23.49% BenchmarkAtof/2.2250738585072011e-308-4 84.2 67.1 -20.31%
Nice. Great observation, @remyoudompheng. I haven't published my parsing code yet; really need to finish that. I believe it can be made fast regardless of input string length, i.e., scale linearly.
As of commit https://github.com/remyoudompheng/go/commit/ed351765ae the Atof implementation is now extended to work whenever the decimal digits form a number that fits in a uint64. It can also handle non ambiguous very long inputs that simply converting the rounded-up mantissa if it yields the same result. This allows to work with more inputs, even a few exceptional corner cases that I have added to tests (all rejected by the "reverse Grisu3")
// Halfway is 500016682268521616.00000000000001e229 {"500016682268521616e229", "5.000166822685216e+246", nil}, // 18 digits necessary // Halfway is 1873795671212201760.9999999999999998e108 {"1873795671212201761e108", "1.873795671212202e+126", nil}, // 19 digits (61 bits) necessary // Halfway is 10027399025072458413.99999999999998e140 {"10027399025072458414e140", "1.002739902507246e+159", nil}, // 20 digits (64 bits) necessary
@ulfjack I am interested in knowing whether the "TestRyuNoCarry" is exactly the same proof as in your paper or not. The implementation is now feature complete compared to what Grisu3 was used for (shortest formatting, fixed formatting, parsing).
Final benchmarks:
BenchmarkAtof/33909-4 23.2 23.0 -0.86% BenchmarkAtof/339.7784-4 29.0 28.7 -1.03% BenchmarkAtof/-5.09e75-4 63.8 45.2 -29.15% BenchmarkAtof/18446744073709551608-4 98.0 68.8 -29.80% BenchmarkAtof/123456789123456789123456789-4 90.7 82.1 -9.48% BenchmarkAtof/622666234635.3213e-320-4 80.0 59.1 -26.12% BenchmarkAtof/33909#01-4 22.9 22.2 -3.06% BenchmarkAtof/339.778-4 26.7 26.8 +0.37% BenchmarkAtof/12.3456e32-4 66.2 61.4 -7.25% BenchmarkAtof/2.3399415873862403e69-4 2836 59.0 -97.92% BenchmarkAtof/500016682268521616e229-4 19610 67.7 -99.65% BenchmarkAtof/1873795671212201761e108-4 6383 76.2 -98.81% BenchmarkAtof/10027399025072458414e140-4 7728 79.2 -98.98% BenchmarkAtof/100000000000000016777215-4 889 833 -6.30% BenchmarkAtof/100000000000000016777216-4 768 721 -6.12% BenchmarkAtof/6808957268280643e116-4 5828 57.7 -99.01% BenchmarkAtof/4.334126125515466e-210-4 9924 59.6 -99.40% BenchmarkAtof/1.68514038588815e-309-4 18428 58.8 -99.68% BenchmarkAtof/9.11691642378e-312-4 19888 55.0 -99.72% BenchmarkAtof/1.62420278e-315-4 73.7 54.3 -26.32% BenchmarkAtof/2.2250738585072011e-308-4 82.7 61.1 -26.12%
@remyoudompheng I'm afraid I don't understand your question. The proof in the paper only covers binary to decimal conversion, not the other way round. I have since written down an extended proof that shows that the same concepts apply to all source and target bases with minor changes for certain base pairs. I believe that my implementation closely follows the proof.
@ulfjack I meant that this unit test (https://github.com/remyoudompheng/go/blob/ed351765ae8a8307e4df08a24511ec058b7f7ccc/src/strconv/extfloat2_test.go#L478) aims at embedding a proof of what is paragraph 3.2.3 of your paper, to make the code self-contained. I believe it is essentially equivalent.
@remyoudompheng and I discussed this and he's going to work on creating the CLs for his implementation in strconv. I may help review the changes. I'm changing the assignment accordingly.
Status update:
Shortest formatting is not yet cleaned up.
Shortest formatting is now included and I added a fix to avoid all long divisions on 32-bit platforms (especially arm where performance was awful). ARM now gets performance gains as well.
Change https://golang.org/cl/170078 mentions this issue: strconv: implement Ryū-like algorithm for atof
Change https://golang.org/cl/170079 mentions this issue: strconv: implement Ryū-like algorithm for fixed precision ftoa
Change https://golang.org/cl/170080 mentions this issue: strconv: Implement Ryū algorithm for ftoa shortest mode
Reading @rsc's comment on CL 170078, it looks like these changes were intended to be reviewed early-in-cycle for 1.15. @remyoudompheng is this still the case?
go version devel +15f2d0e Fri May 13 01:19:05 2016 +0000 linux/amd64