Closed Gabriella439 closed 10 years ago
Did you push this repair to github? I finally started trying to figure out what the bug was yesterday with tests like these https://gist.github.com/michaelt/7207991 Rather than returning immediately when it comes upon a newline at the end of the file, it makes an empty producer that returns. There is a different problem with words
Oops! I forgot to upload it. I will do that soon.
Also, what is the problem with words
?
unwords . words
acts more like map (\c -> if isSpace c then ' ' else c)
["x x"]
"x x" -- Prelude unwords.words
"x x" -- BC8 unwords.words
"x x" -- Pipes.Bytestring unwords.words
"x x" -- Text unwords.words
"x x" -- Pipes.Text unwords.words
["x x"]
"x x"
"x x"
"x x"
"x x"
"x x"
["x "," x"]
"x x"
"x x"
"x x"
"x x"
"x x"
["x ","x "]
"x x"
"x x"
"x x "
"x x"
"x x "
This gives the right results for unwords . words
using my little tests. It could be simplified a little. It may be that starting from splitWith
is not ideal?
words
:: (Monad m) => Producer ByteString m r -> FreeT (Producer ByteString m) m r
words p0 = removeEmpty (splitWith isSpaceWord8 p0)
where
removeEmpty f = PP.FreeT $ do
x <- PP.runFreeT f
case x of
PP.Pure r -> return (PP.Pure r)
PP.Free p -> loop p
loop p = do
y <- next p
case y of
Left f' -> PP.runFreeT (removeEmpty f')
Right (txt, p') ->
if BS.null txt
then loop p'
else return $ PP.Free $ do
yield txt
f' <- p'
return (removeEmpty f')
{-# INLINABLE words #-}
Which yields
[" x"]
"x"
"x"
"x"
"x"
"x"
[" x"," "]
"x"
"x"
"x"
"x"
"x"
["x x"]
"x x"
"x x"
"x x"
"x x"
"x x"
["x x"]
"x x"
"x x"
"x x"
"x x"
"x x"
["x "," x"]
"x x"
"x x"
"x x"
"x x"
"x x"
["x ","x "]
"x x"
"x x"
"x x"
"x x"
"x x"
Yeah, I like your idea, because this very closely matches how words
is implemented in Data.ByteString.Char8
:
words = P.filter (not . B.null) . B.splitWith isSpaceWord8
I will go ahead and fix that, too, (and upload both lines
and words
).
By the way, do you think what is desired under a type like
decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)
will forward as Text just as much as is intelligible as text, returning a Producer of Bytestrings that starts where the other left off? It occurs to me that this might not be as useful as it seems: if I know that a stream contains a mixture of text and binary information, I should have some sort of parser telling me where text leaves off and binary goo begins, since binary goo can look like anything you please.
In any event, the fancy continuations supplied by the streamDecodeUtf8
function in Data.Text.Encoding
are not seeming to be aimed at this. The idea, I think, is just to make things like this possible:
pipeDecodeUtf8 :: Monad m => Pipe ByteString Text m r
pipeDecodeUtf8 = go streamDecodeUtf8 where
go decoder = do
bs <- await
case decoder bs of
Some txt unread newdecoder ->
do yield txt
go newdecoder
which I just added, with a variant that takes a replacement function.
As it is implemented here, so far, decodeUtf8
does not give the result described above. It stops forwarding text after the first chunk that ends in the middle of a character, which defeats the purpose of the streamDecodeUtf8
function, which is to carry character fragments over. Otherwise it fails, since if streamDecodeUtf8
meets bytes that aren't the beginning or end of a character, it raises an exception. One could try to catch these in IO, which might be done comparatively sensibly; if one tries to do it purely one needs to avoid the error with a replacement function, then somehow recover the information that a substitution was made, etc. -- i.e. each step would require some sort of comparison of the resulting text with the bytestring it came from -- or something like that.
In any case, there would still be the problem of dividing a failed chunk into the 'good part', to be textified, and the rest, to be returned as the first yield of the ByteString producer. I.e. the same problem in the small. So either way one would end up in the kind of complexities we see here with byteSplits
and splitSlowly
: http://hackage.haskell.org/package/conduit-1.0.8/docs/src/Data-Conduit-Text.html#decode
Or perhaps bos' streamDecodeUtf8
could do with some further complication.
So if you have preexisting information about where the text ends and the bytes begin, you would just use something like Pipes.ByteString.splitAt
to partition the byte stream into those two halves:
bytestream
:: Producer ByteString m ()
splitAt 1024 bytestream
:: Producer ByteString m (Producer ByteString m r)
decodeUtf8 (splitAt 1024 bytestream)
:: Producer Text m (Producer ByteString m (Producer ByteString m r))
... where the first Producer ByteString
would correspond to undecodable bytes before reaching the true byte stream and the second Producer ByteString
would correspond to what we originally intended to be the byte payload.
I want to briefly comment that perhaps the type of decodeUtf8
should be:
decodeUtf8 :: (Monad m) => Producer ByteString m r -> Producer ByteString m (Either (Producer Bytestring m r) r)
... where it returns a Left
value if it encountered bad bytes and a Right
value if it successfully got to the end. This would make it so that the user would not need to test the byte residue to see if there were bad bytes or not.
streamDecodeUtf8
is actually correct and has all the machinery necessary to keep going if a chunk ends in the middle of a character. You can see the original discussion behind it here, but I'll briefly summarize it.
streamDecodeUtf8
's continuation takes into account the incomplete character from the previous chunk (i.e. it remembers that it is in the middle of decoding a character). The reason it also returns a ByteString
field is to handle the case where there are no more chunks. If you still have chunks left you are supposed to ignore that field and just keep feeding those chunks to the continuations.
So in other words, if you have two ByteString
chunks, bs1
, and bs2
, where bs1
contains a byte at the end that does not completely decode, then the Decoding
will return the Text
that it successfully decoded up until that last byte, and the continuation will already know that it is in the middle of parsing a character, so you can just pass bs2
directly to the continuation. The Decoding
returned by that continuation will return a Text
chunk whose byte representation includes the byte from the previous t1
ByteString
chunk.
Now, if t2
ends with an incomplete character and you have no more chunks left to feed the Decoding
, then you just take the ByteString
it returns and make that the final undecode residue.
However, the residue may be greater than a single ByteString
chunk. This will happen if you have a decoding error (not an incomplete byte sequence, but rather an invalid UTF8 byte sequence, and these do exist). This can happen in the middle of any chunk (not just the last one), in which case you need to return all subsequence chunks as undecoded bytes.
hi,
having written the implementation of decodeUtf8, are you sure it is wrong? i am fairly sure text's streamDecodeUtf8 includes the remaining byte(s) only when there aren't any chunks anymore. my testing did not show any such cases you describe but i might have missed something (or plenty).
i guess i am asking: did you come to your conclusion by testing?
pipeDecodeUtf8 :: Monad m => Pipe ByteString Text m r pipeDecodeUtf8 = go streamDecodeUtf8 where go decoder = do bs <- await case decoder bs of Some txt unread newdecoder-> do yield txt go newdecoder
i am sure you know, but i want to mention it. you can use decodeUtf8 Pipes.ByteString.stdin >>= const (return ()
which (nearly) gives you the function above. maybe. i'm not sure whether i am (as usually) missing something.
greetings, tobi
Yes I understand this, I think. The ByteString field is not needed for the Pipe I defined; the continuation carries over the incomplete character information. However, the incomplete character information is contained in the bytestring information. So if we test for its being null, we take too much interest in it, so to speak. (One could try to test for its being less than 3 bytes or whatever.) The trouble is the case of failure in the middle so to speak. Intuitively this should yield, in the most conservative case: any text recovered from an initial segment; the remaining undigested bytestring; and then so to speak Nothing for the continuation. Then our decodeUtf8
would forward the text, and return the bytestring with the remaining producer. But maybe there is a way of evading the errors streamDecodeUtf8With
gives for strictDecode
on bytestrings with non-character information.
What's wrong with the solution I proposed to use a splitter? Pipes.ByteString.splitAt
or Pipes.ByteString.span
(or any other custom splitter) can partition the Producer ByteString
into two halves: the first halve is what decodeUtf8
should assume is all valid Text
. The second halve is non-character information. This requires no modification to streamDecodeUtf8
or decodeUtf8
and it's a nice separation of concerns.
Moreover, it allows decodeUtf8
make stronger statements when parses fail: if the parse failed it is because it was definitely an invalid or incomplete byte sequence and not a non-character payload. If you change them to assume the possibility of non-textual data then there is no way to opt out of that assumption when you know that the data is in fact supposed to be textual.
Yes, I thought I understood this before, but here is the result of breaking a' = encodeUtf8 $ T.pack $ "你好"
, i.e. "\228\189\160\229\165\189"
into two bytestrings at each possible position and feeding the fragments successively to streamDecodeUtf8
https://gist.github.com/michaelt/7377887#file-output
I don't see the problem. It correctly decoded every character and didn't add or remove any spurious bytes.
Sorry, I'm still not making this clear. The important point is the 'leftover bytestring' element. Though this is a perfectly good pure text stream, decodeUtf8
will stop it after the second chunk of the twelve that are sent, since it inspects the bytestring field of Some txt bs f
Oh, I see. decodeUtf8
has no way to tell if there was a decoding error or not (i.e. an invalid UTF8 byte sequence). At least, it could but it require sophisticated logic that really belongs in streamDecodeUtf8
.
One solution to this is to modify the Decoding
type to include a constructor to indicate decoding failure so that decodeUtf8
can know when it needs to continue, like this:
data Decoding = Some ... | FailedParse ByteString
However, this makes it impossible to resume from invalid byte sequences.
Another possibility is that in case of invalid UTF8 byte sequences you simply give the user the failed bytes and let them decide whether or not to continue:
data Decoding = Some ... | FailedParse Word32 (ByteString -> Decoding)
Oops, I forgot that one character might not be exactly 32 bits in UTF8, but you get the idea. Just replace Word32
with ByteString
or some other representation for an invalid byte sequence.
decodeUtf8 has no way to tell if there was a decoding error or not (i.e. an invalid UTF8 byte sequence). At least, it could but it require sophisticated logic that really belongs in streamDecodeUtf8
Right, this is what I was trying to say in a maybe too opaque way at the outset. To recover what should (intuitively) be in streamDecodeUtf8 as things are would require something like the conduit text business. Thats why I said that as it is it is really only suited to making a naive Pipe ByteString Text m r
, or so it was seeming.
My initial remark was somewhat complicated by my suspicion that bos was aware of this feature of the function. streamDecodeUtf8
is the stream-y equivalent of Data.Text.decodeUtf8 :: ByteString -> Text
which isnt matched by a decodeDecodableUtf8 :: ByteString -> (Text,ByteString)
but this is what we are looking for the streamy equivalent of.
I honestly think we should just propose extending Decoding
to something like
data Decoding = Some Text ByteString (ByteString -> Decoding) | FailedParse BadBytes (ByteString -> Decoding)
data BadBytes = ???
That would allow us to both detect and recover from failed parses.
Yes, I was just thinking of
data Decoding = Some Text ByteString (ByteString -> Decoding) | Other Text ByteString
or the like, since I thought this would not be too hard to implement given what he has. I was looking for example at the three possibilities in https://github.com/bos/text/blob/master/cbits/cbits.c#L115 The one we are interested in is the second. When it gets to this point, it ought to have constructed the text array that it could construct and has now backed up to the point in the bytestring where it went wrong. So why shouldn't it present them with us both?
There is the question of the character information it took into the parse. If it fails immediately, so that the Text field of my imagined Other
is empty, we can still recover the missing bytes from the previous step in any recursive definition:
decodeUtf8 = go B.empty streamDecodeUtf8 where
go old dec p = ...
but I don't know.
Note that if Other
does not return a continuation then there is no way to resume from a failed parse.
I don't have enough time at the moment to review the C code, but I will check it out later to study what we have to work with.
Other
doesn't contain the missing information, but I was thinking any recursive use of an associated streamDecodeUtf8
can recover it from the previous step (which is either the base case or a successful step) as the remaining chunk. But yes, I have no clue what the best idea is; that was just the picture I had formed in my head before mentioning this.
I implemented the simple idea I mentioned https://github.com/michaelt/text/commit/942e4c894fbb6cafd8aafdc5c5f148e85ac66dbf
Even if the type is defensible, some of the logic could do with more thought. In the type
streamDecodeUtf8With :: Maybe OnDecodeError -> ByteString -> Decoding
the Maybe OnDecodeError
bit is maybe a little simple minded: it basically lets you decide whether you will go about things in the way we seem to be imagining (Nothing
) or that you will use the exceptions and/or replacement scheme bos
has in place.
The revised version of ibotty's function would look something like this:
decode = go B.empty (streamDecodeUtf8With Nothing)
where
go old dec p = do
x <- lift (next p)
case x of
Left r -> return (return r)
Right (chunk, p') -> do
case dec chunk of
Some txt new dec' -> do
yield txt
go (accum txt old new) dec' p'
Other txt bs -> do
yield txt
return $ do yield (accum txt old bs)
p'
accum txt old new | T.null txt = old <> new
| otherwise = new
I think that's correct; it's based on a [ByteString]
function I put in the quickcheck Properties file. The unpleasant necessity of constantly checking the Text field to see if it is null
becomes clearer when you imagine that as you near the transition from Text to nonsense, you are being given one-byte bytestrings, or empty bytestrings. The developing continuation might consume any number of such bytestrings before it commits a Text, but then at the last moment realize it can't. So one won't be able to recover the information it contains without keeping track. Maybe this could somehow be put into streamDecodeWith
with a better type.
Ok, so back to the original topic (lines
and words
), I finally pushed the fix to lines
and simplified the implementation of words
for pipes-bytestring
. You can consult those for the Text
equivalents.
This words
is nice demonstration of pipes style or pipes-parse style. I ran my little tests and made the corresponding changes. I will look into the RULES business.
Esteemed comrades Gabriel and Tobias, I opened an issue corresponding to this business https://github.com/bos/text/issues/60 when it emerged that the new version of text is already on hackage. I guess this was the right thing to do? I mentioned it also to bgamari wondering if he might have a thought.
I was about to study the lenses
branch of pipes-parse
etc, when it occurred to me to see what would happen if one had done with the problem above by just recycling the cbits from text and ringing changes. The result is the branch https://github.com/michaelt/text-pipes/tree/stream I am pretty sure the material taken from text
and revised is sound (unless by a subtlety to do with the character replacement scheme text uses). The revised Text.Pipes
module (which is crude and has a couple bugs I've already thought of) is a little more complicated on further reflection, since the type of something like fromHandle
is Handle -> Producer Text IO (Producer ByteString IO r)
-- similarly for basically all things beginning Producer Text ...
Of course, that sort of seems right.
I don't know what to make of such an idea. One question it obviously raises is whether to have done with the character replacement business which would in one way simplify things, and in another way complicate them.
Yeah, that's the right type, since that is what you would get if you applied decodeUtf8
to Pipes.ByteString.fromHandle
.
I also want to note that the pipes-bytestring
fromHandle
and toHandle
now run as fast as conduit
in HEAD
. I found out what was causing the slowdown, which was that pipes-bytestring
was using hIsEOF
to test for end of input, whereas conduit
was just checking to see if the retrieved ByteString
was empty. Now that I have switched to testing the ByteString
it runs as fast as conduit
now.
This means that you don't need to provide a separate fromHandle
function for pipes-text
. decodeUtf8
suffices and then people can reuse Pipes.ByteString.fromHandle
. The other reason is so that you don't have to define all the minor variations on fromHandle
, like hGet
/hGetN
/etc. The third reason is that you don't want to have to define a separate fromHandle
function for each possible decoding (once we move beyond UTF8 decoding).
Also, note that there is a lenses
branch of pipes-bytestring
that has the API upgraded to use all the new pipes-parse
idioms. However, I would caution you not to spend the effort to upgrade pipes-text
to the new parsing idioms until I merge them into master
. I want to put them through their paces more to see if there are any awkward parts of the API that I might have missed so far.
Ah, yes, I had already taken the hIsEof
bit on board, somehow; fromHandle
as it was just inlined PB.fromHandle
together with decodeUtf8
.
In the version of the branch that I just pushed https://github.com/michaelt/text-pipes/tree/stream I have scrapped character replacement entirely which vastly simplifies everything including the quickcheck tests.
I haven't removed things like fromHandle
and readFile
, as I think ibotty and I were both thinking the module might well include some material akin to Pipes.Prelude
(and say Data.Text.Lazy
). Maybe this is turning out to be wrong, as the material is seeming a bit more technical.
On the other hand I wonder if a crude decoder function and IO accompaniments might not be desirable as an emulation of the text
package and maybe to facilitate interoperation with it? That e.g. decodeUtf8
has type ... -> Pipe Text m (Pipe Bytestring m r)
rather than ... -> Pipe Text m r
might seem a somewhat heavy-weight method of error handling. I wonder if the text
package would be as usable if something like decodeUtf8
had the type ByteString -> (Text, ByteString)
, and if e.g. the things in Data.Text.IO
like readFile
were similarly typed. But I don't see how this could be done sensibly.
I just fixed the
lines
function inpipes-bytestring
, so you can copy that now for theText
equivalent.