haskell-streaming / streaming-bytestring

effectful sequences of bytes; an alternative no-lazy-io implementation of Data.ByteString.Lazy
BSD 3-Clause "New" or "Revised" License
16 stars 11 forks source link

Improved `readInt` implementation #29

Closed vdukhovni closed 3 years ago

vdukhovni commented 3 years ago

REVIEW topics:

If someone wants to bikeshed a better name for the Chunker type, I am not particularly fixed on the current name, though it seems OK to me. Or we could drop the type alias and use the full expansion in the signature of readInt', but that could make the haddock difficult to fit horizontally on the screen.

Note that finiteBitSize requires base 4.7, i.e. at least GHC 7.8. The CI does not test anything older than 7.10, so I assume that 7.6 is no longer supported. Otherwise a work-around is needed for older versions.

The LambdaCase and MultiWayIf pragmas are also used, these require at least GHC 7.6.

Cc: @bodigrim, @chessai

fosskers commented 3 years ago

The lowest GHC we support is 7.10. Thanks also for adding all those test cases!

vdukhovni commented 3 years ago

Please don't merge this, it does not take effects into account correctly. I'll rework it to address the issue.

[ Actually, perhaps not a problem, I was thrown off by thinking about how overflow handling might work, but since we're not doing overflow detection, I think this is OK, but just in case, I'll review again... ]

vdukhovni commented 3 years ago

I pushed a new commit that handles cases where the leading + or - is at a chunk boundary, and what follows is not an unsigned integer. This needs to return Nothing not 0, but also return a leftover string with the + or - sign as a short chunk in front of the chunk with the non-integer payload. Since it is not too difficult (but quite wrong) to accidentally return just the original input stream (one of whose effects is already performed), I enhanced the tests to verify that we're not performing too few or too many effects by the time the entire stream has been processed.

Also added tests to check some more "+"/"-" at chunk boundary cases.

vdukhovni commented 3 years ago

I am prototyping (no PR yet), a new implementation that restores (more correct) overflow protection:

The performance is only slightly worse than this PR, for 100M space-separated Ints that are 1..100M in random order, the time to read and add them all up changes from ~4.84s to 4.88s on my (somewhat dated) Intel CPU.

By introducing the "tail trimming" variant of readInt, I'm inclined to drop the generic "Chunker" interface in this PR as being too fancy. In practice users will likely want just the verbatim remainder, or the trimmed version. The only thing they won't get that way is the ability to control the safety mechanism, by e.g. choosing a different limit on the number of spaces consumed, or setting no limit at all. [ Edit: I'll look into whether good performance is preserved if the whitespace skip is instead performed before reading each integer, rather than after, this is more friendly for users who intermix reading integers with other consumers of stream data, but if this is done, the whitespace will be dropped unconditionally, even when ultimately returning Nothing. ]

So before I open a "competing" PR, I'd like some feedback on what folks think about this version versus the description above?

I should also mention that while the code could In principle handle other bounded integral types, the approach can only work for decimal (or octal) inputs, with hex the input can use mixed case, and I can no longer reconstruct the input from the partly accumulated digits, which may have originated from separate chunks, making the code more complex and likely much slower.

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow. So handling Int32 or Word32, ... would just be a matter of tweaking the bounds, but I am not inclined to address that at this time. Support for the other types was not previously available. We can look into that later.

One of the complications of handling the various (Bounded) integral types is that, unlike C, Haskell has no uintmax_t equivalent. The closest we've got is the Word64 and Int64 types, and there's some complex logic around how these have been bolted on in 32-bit vs. 64-bit systems. It seems that's starting to change, IIRC a recent PRs was introducing Word64# primitives unconditionally.

Anyway, more could be done, but this is what I have now... Time for feedback...

CC: @chessai, @hsyl20, @bodigrim, @sjakobi, @cartazio

Bodigrim commented 3 years ago

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow.

BTW there is timesWord2, which may be handy to detect overflows.

vdukhovni commented 3 years ago

Internally the accumulator is a Word value, and all the digits are combined unsigned into the word, what's variable is the bounds at which I detect overflow.

BTW there is timesWord2, which may be handy to detect overflows.

Yes, I was aware of it, but since I have to a comparison either way (before multiplication or after), I went with the simpler "before" option. If the accumulator is below 1/10 of the maximum representable value, I keep going, if above I fail. And if equal, I check the input digit to see whether it is below or above the residue of the maximum value mod 10. This runs quite efficiently, and for most numbers only one test is needed, which I think is no more expensive than using and testing the result of timesWord2#.

My experience with using numeric primops is that it is easy (and somewhat counter-intuitive) to actually get worse performance with them, because GHC seems to do fewer optimisations on code with primops, or because it is able to figure out a better set of primops to apply than I can, but many times when I thought I could write better code directly (avoiding allocations, ...) I found that GHC typically already used unlifted values wherever possible most of the time, and seems to have produced better code in most cases. :-)

There are a few cases in which I was able to get better code with priomops, but without knowing more about internals than I do, I struggle to intuit which are the cases where it is likely to yield good results.

But while, I have your attention, direct feedback on the substance of this PR, and proposed alternative would be very much appreciated. The tangential comments are of course also welcome! :-)