Improved `readInt` implementation

vdukhovni commented 3 years ago

Removes arbitrary 18 byte limit on decimal input.
- This former limit means that long decimal strings could on repeated be incorrectly parsed as multiple ints.
- Also 18 bytes is not even enough for 64-bit minBound or maxBound.
Significantly speeds up readInt by avoiding "toStrict", and handling chunk boundaries efficiently.
Adds a readInt' function that supports transformation of the stream tail to prepare for reading more input, by e.g. dropping leading whitespace while we are still holding the raw chunk, avoiding the cost of burying it deep in a closure, only to later find it again and trim.
Reading ~100M space separated ints now takes ~5.5s vs. over 25s with the original implementation.
This does however mean that on an unbounded stream of digits the new readInt is not guaranteed to ever return. The documentation notes that users can elect to split the stream and only run readInt on some bounded initial segment.

REVIEW topics:

If someone wants to bikeshed a better name for the Chunker type, I am not particularly fixed on the current name, though it seems OK to me. Or we could drop the type alias and use the full expansion in the signature of readInt', but that could make the haddock difficult to fit horizontally on the screen.

Note that finiteBitSize requires base 4.7, i.e. at least GHC 7.8. The CI does not test anything older than 7.10, so I assume that 7.6 is no longer supported. Otherwise a work-around is needed for older versions.

The LambdaCase and MultiWayIf pragmas are also used, these require at least GHC 7.6.

Cc: @bodigrim, @chessai

fosskers commented 3 years ago

The lowest GHC we support is 7.10. Thanks also for adding all those test cases!

vdukhovni commented 3 years ago

Please don't merge this, it does not take effects into account correctly. I'll rework it to address the issue.

[ Actually, perhaps not a problem, I was thrown off by thinking about how overflow handling might work, but since we're not doing overflow detection, I think this is OK, but just in case, I'll review again... ]

vdukhovni commented 3 years ago

I pushed a new commit that handles cases where the leading + or - is at a chunk boundary, and what follows is not an unsigned integer. This needs to return Nothing not 0, but also return a leftover string with the + or - sign as a short chunk in front of the chunk with the non-integer payload. Since it is not too difficult (but quite wrong) to accidentally return just the original input stream (one of whose effects is already performed), I enhanced the tests to verify that we're not performing too few or too many effects by the time the entire stream has been processed.

Also added tests to check some more "+"/"-" at chunk boundary cases.

vdukhovni commented 3 years ago

I am prototyping (no PR yet), a new implementation that restores (more correct) overflow protection:

All valid inputs are correctly converted, even those with 19 digits (or sign + 19 digits), supporting the full range from minBound to maxBound
All inputs that would overflow return Nothing and the invalid input as the tail of the string, but internally, that input is magically reconstructed from the accumulated partial number (last value prior to would-be overflow) and the number of digits consumed, by generating the corresponding decimal string with enough leading zeros and the original explicit sign (if any). This frees us from having to copy the input.
To avoid being stuck forever reading infinite streams of zeros, a safety mechanism gives up after reading ~32k leading zeros, at some point the input is no longer valid, but rather an attack. The logic is not go to the next chunk if we've already accumulated that many bytes. However, if the chunk we already have in memory is for some reason already large, then I'm willing to parse it through to the end.
Accompanying this, I will add a variant that automatically skips trailing whitespace after parsing each number, which makes reading streams of numbers separated by whitespace much more efficient. This too has a safely limit, after ~32k of trailing whitespace, no more whitespace is skipped, again to avoid being stuck forever on hostile inputs, if reading from a network or other unbound source.

The performance is only slightly worse than this PR, for 100M space-separated Ints that are 1..100M in random order, the time to read and add them all up changes from ~4.84s to 4.88s on my (somewhat dated) Intel CPU.

By introducing the "tail trimming" variant of readInt, I'm inclined to drop the generic "Chunker" interface in this PR as being too fancy. In practice users will likely want just the verbatim remainder, or the trimmed version. The only thing they won't get that way is the ability to control the safety mechanism, by e.g. choosing a different limit on the number of spaces consumed, or setting no limit at all. [ Edit: I'll look into whether good performance is preserved if the whitespace skip is instead performed before reading each integer, rather than after, this is more friendly for users who intermix reading integers with other consumers of stream data, but if this is done, the whitespace will be dropped unconditionally, even when ultimately returning Nothing. ]

So before I open a "competing" PR, I'd like some feedback on what folks think about this version versus the description above?

I should also mention that while the code could In principle handle other bounded integral types, the approach can only work for decimal (or octal) inputs, with hex the input can use mixed case, and I can no longer reconstruct the input from the partly accumulated digits, which may have originated from separate chunks, making the code more complex and likely much slower.

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow. So handling Int32 or Word32, ... would just be a matter of tweaking the bounds, but I am not inclined to address that at this time. Support for the other types was not previously available. We can look into that later.

One of the complications of handling the various (Bounded) integral types is that, unlike C, Haskell has no uintmax_t equivalent. The closest we've got is the Word64 and Int64 types, and there's some complex logic around how these have been bolted on in 32-bit vs. 64-bit systems. It seems that's starting to change, IIRC a recent PRs was introducing Word64# primitives unconditionally.

Anyway, more could be done, but this is what I have now... Time for feedback...

CC: @chessai, @hsyl20, @bodigrim, @sjakobi, @cartazio

Bodigrim commented 3 years ago

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow.

BTW there is timesWord2, which may be handy to detect overflows.

vdukhovni commented 3 years ago

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow.

BTW there is timesWord2, which may be handy to detect overflows.

Yes, I was aware of it, but since I have to a comparison either way (before multiplication or after), I went with the simpler "before" option. If the accumulator is below 1/10 of the maximum representable value, I keep going, if above I fail. And if equal, I check the input digit to see whether it is below or above the residue of the maximum value mod 10. This runs quite efficiently, and for most numbers only one test is needed, which I think is no more expensive than using and testing the result of timesWord2#.

My experience with using numeric primops is that it is easy (and somewhat counter-intuitive) to actually get worse performance with them, because GHC seems to do fewer optimisations on code with primops, or because it is able to figure out a better set of primops to apply than I can, but many times when I thought I could write better code directly (avoiding allocations, ...) I found that GHC typically already used unlifted values wherever possible most of the time, and seems to have produced better code in most cases. :-)

There are a few cases in which I was able to get better code with priomops, but without knowing more about internals than I do, I struggle to intuit which are the cases where it is likely to yield good results.

But while, I have your attention, direct feedback on the substance of this PR, and proposed alternative would be very much appreciated. The tangential comments are of course also welcome! :-)

haskell-streaming / streaming-bytestring

Improved `readInt` implementation #29