haskell / attoparsec

A fast Haskell library for parsing ByteStrings
http://hackage.haskell.org/package/attoparsec
Other
512 stars 93 forks source link

takeTill acting wierd #80

Closed banacorn closed 9 years ago

banacorn commented 10 years ago
parser :: Parser Text
parser = takeTill ((==) 'a'))

main :: IO ()
main = parseTest parser "𝟘a" >>= print

The code should result in Done "a" "\120792", a clean cut. But I get Done "\57304a" "\120792"

With the predicate negated, takeWhile also presents the same issue.

The issue can be reproduced with this gist I'm using attoparsec-0.12.1.2 with text-1.2.0.0

Thanks!

SeanRBurton commented 10 years ago

Note that '\57304' is the second element of the surrogate pair of '𝟘' which suggests that this bug is caused by advancing by 16 bits irrespective of the width of any particular character. I can reproduce this bug (and similar bugs in scan, peekChar, takeText, and takeLazyText) using any character which requires 32 bits to represent (i.e. ord c >= 2^16).

basvandijk commented 9 years ago

...this bug is caused by advancing by 16 bits irrespective of the width of any particular character.

It looks like you're correct.

bos commented 9 years ago

Thanks for the helpful repro. I'll take a look at this as soon as I can.

hesselink commented 9 years ago

Any chance of a release with this fix? I just ran into this with an even simpler reproduction: takeText "💋".

banacorn commented 9 years ago

ha, is that a pair of lips?

bos commented 9 years ago

Released as 0.12.1.3.

hesselink commented 9 years ago

Thanks!