Open erantapaa opened 10 years ago
Yeah. I have a complete rewrite of the bytestring IO code that's mostly done. The new code respects the newline mode. But since it is slightly different behaviour I'll probably include the IO actions in a new module (and eventually deprecate the existing ones).
What's the status on this?
hGetLine
from System.IO
handles CR (http://hackage.haskell.org/package/base-4.10.0.0/docs/src/GHC.IO.Handle.Text.html#hGetLine); this discrepancy seems bad. Would it be acceptable to change the behaviour?
Any updates?
Apparently there's quite a bit of demand for a fix.
@dcoutts What happened to the IO code rewrite you mentioned? It would be interesting to see how you solved this issue!
since it is slightly different behaviour I'll probably include the IO actions in a new module (and eventually deprecate the existing ones).
Yeah. I suspect that simply changing the behaviour of getLine
and hGetLine
could possibly cause some subtle bugs in existing code.
So how about adding variants like getLineCrossPlatform
and hGetLineCrossPlatform
that respect the handle's newline settings?
For reference, here's the relevant code:
I think that we should bring Data.ByteString.hGetLine
in sync with System.IO.hGetLine
. Ignoring haInputNL
is clearly a grave bug, and I cannot imagine anyone making sensible use of the current behaviour. So I'm in favor of bug fixing hGetLine
instead of introducing a new function, but anyways a PR is very welcome.
I'd like to give this a shot, although it'll be my first time writing with low-level Haskell IO.
I intend to make Data.ByteString.hGetLine
's handling of newlines consistent with System.IO.hGetLine
.
So that
BS.getLine = fmap BS.pack getLine
holds.
Referencing issue #249, I'm thinking that it makes sense to put the fixed version of hGetLine
into Data.ByteString.Char8.hGetLine
.
Then, legacy code won't break (i.e. in case they were compensating for the extra \r
s on their own, we don't want them to start chopping off the last meaningful character on Windows) and we can have the nice behaving Data.ByteString.Char8.hGetLine
that we wanted in the first-place.
The depreciation of ByteString.hGetLine
can then be for both reasons, the strange Windows behavior and the ASCII encoding assumption maybe with a message like.
{-# DEPRECATED getLine
"Use Data.ByteString.Char8.getLine instead. (Data.ByteString.getLine does not correctly handle \r\n on Windows; Functions that rely on ASCII encodings belong in Data.ByteString.Char8)"
#-}
I intend to make
Data.ByteString.hGetLine
's handling of newlines consistent withSystem.IO.hGetLine
.
Agreed, feel free to go ahead.
Then, legacy code won't break...
The thing is that existing Data.ByteString.hGetLine
is re-exported from Data.ByteString.Char8
as well. So one cannot confidently guarantee that legacy code, compenstating for \r
on its own, does not break.
We can bikeshed whether this constitutes a bug fix or a breaking change later on.
Ah, that's unfortunate. I overlooked that re-export.
I've implemented this along with some tests to verify it works but I still have 2 concerns with this change.
I haven't think of a good way to test the property that
BS.hGetLine = BS.pack . System.IO.hGetLine -- when all the data is ascii
because there is an IO
in the way and I haven't thought of an efficient way to make handles, other than making files and verifying that they are the same by reading it both ways.
As of now, I just test the property over the same file, ending in 4 different ways, with all 4 choices of NewlineMode
.
I currently have no benchmarks for BS.hGetLine
I'm concerned that this implementation may cause performance regressions but I don't want to start optimizing until I have some measurements to work off of.
Also, an interesting quirk to notice is that System.IO.hGetLine
treats both, \n
and \r\n
as line endings when haInputNL=CRLF
.
I'm not sure if that is "correct windows behavior" or not, specifically I recall old versions of notepad mushing all lines onto one line when separated by \n
s.
IO
in QuickCheck properties, see Test.QuickCheck.ioProperty
Also, an interesting quirk to notice is that
System.IO.hGetLine
treats both,\n
and\r\n
as line endings whenhaInputNL=CRLF
.
Interesting. I'm in favor of mirroring this behaviour.
Also, an interesting quirk to notice is that
System.IO.hGetLine
treats both,\n
and\r\n
as line endings whenhaInputNL=CRLF
.Interesting. I'm in favor of mirroring this behaviour.
:+1:
I have written the property test along with a newtype LinedASCII
that generates ASCII with a lot of newlines.
This was needed because one of the sequences that trips up badly behaved implementations, "\r\r\n"
only has a 1/2^(21) ~ 1 in 2 million chance of occurring in a 3-char substring.
This meant that one had to run a few million tests to reliably catch bugs which is too slow, but the newtype generates a newline 50% of the time, so "\r\r\n"
occurs 1 out of 8 3-char substrings, making the test fairly quick and reliable.
Of course, the chance of getting an ~80char long line is microscopic, so there's still room for improvement.
@dbramucci nice!
@dbramucci how is it going? Do not hesitate to create a draft PR, it could facilitate further discussions.
I just did a quick experiment and it turns out that System.IO.hGetLine
, (the old) ByteString.hGetLine
and Data.Text.IO.hGetLine
all have different behavior in CRLF
mode.
Data.Text.IO.hGetLine
takes the string "\ra"
and interprets this as the line "\na"
System.IO.hGetLine
and the old version of ByteString.hGetLine
both leave the \r
alone and give the line "\ra"
.
Notably, I believe this means that Data.Text.IO.hGetLine
is the only hGetLine
that can return a string containing \n
.
System.IO.hGetLine
takes the string "foo\r\nbar"
and interprets it as providing two lines "foo"
and "bar"
.
The old ByteString.hGetLine
gives "foo\r"
as the first line and "bar"
as the second line.
And while I don't know of any place that relies on this, I wouldn't be surprised to see people caught off-guard by the fact that "foo\nbar"
is two lines, even in CRLF
mode.
And on the note of inconsistencies: lines
and unlines
presume \n
is the line feed, so the law we might expect to hold
head . lines <$> hGetContents = hGetLine
cannot hold if the handle is in CRLF
input mode.
(At least this is somewhat obvious from the type signature of lines
).
Encounter it in codeforces
It would be nice if BS.getLine would use the handle's newline setting so that in
CRLF
mode strings are returned with both the CR (if present) and LF removed.An example of code that demonstrates the behavior:
http://stackoverflow.com/questions/22417171/bs-getline-and-crlf-endings