michaelt / text-pipes

Text pipes
BSD 3-Clause "New" or "Revised" License
10 stars 13 forks source link

Update `lines` to match `pipes-bytestring` #1

Closed Gabriella439 closed 10 years ago

Gabriella439 commented 11 years ago

I just fixed the lines function in pipes-bytestring, so you can copy that now for the Text equivalent.

michaelt commented 11 years ago

Did you push this repair to github? I finally started trying to figure out what the bug was yesterday with tests like these https://gist.github.com/michaelt/7207991 Rather than returning immediately when it comes upon a newline at the end of the file, it makes an empty producer that returns. There is a different problem with words

Gabriella439 commented 11 years ago

Oops! I forgot to upload it. I will do that soon.

Also, what is the problem with words?

michaelt commented 11 years ago

unwords . words acts more like map (\c -> if isSpace c then ' ' else c)

  ["x   x"]
  "x x"   -- Prelude unwords.words
  "x x"   -- BC8  unwords.words
  "x   x"  -- Pipes.Bytestring unwords.words
  "x x"    -- Text unwords.words
  "x   x"  -- Pipes.Text unwords.words
  ["x  x"]
  "x x"
  "x x"
  "x  x"
  "x x"
  "x  x"
  ["x "," x"]
  "x x"
  "x x"
  "x  x"
  "x x"
  "x  x"
  ["x ","x "]
  "x x"
  "x x"
  "x x "
  "x x"
  "x x "
michaelt commented 11 years ago

This gives the right results for unwords . words using my little tests. It could be simplified a little. It may be that starting from splitWith is not ideal?

  words
      :: (Monad m) => Producer ByteString m r -> FreeT (Producer ByteString m) m r
  words p0 = removeEmpty (splitWith isSpaceWord8 p0)
    where
    removeEmpty f = PP.FreeT $ do
      x <- PP.runFreeT f
      case x of
          PP.Pure r -> return (PP.Pure r)
          PP.Free p -> loop p
    loop p = do 
              y <- next p
              case y of
                  Left   f'      -> PP.runFreeT (removeEmpty f')
                  Right (txt, p') -> 
                    if BS.null txt 
                       then loop p'
                       else return $ PP.Free $ do
                                      yield txt
                                      f' <- p'
                                      return (removeEmpty f')
  {-# INLINABLE words #-}

Which yields

  [" x"]
  "x"
  "x"
  "x"
  "x"
  "x"
  [" x"," "]
  "x"
  "x"
  "x"
  "x"
  "x"
  ["x   x"]
  "x x"
  "x x"
  "x x"
  "x x"
  "x x"
  ["x  x"]
  "x x"
  "x x"
  "x x"
  "x x"
  "x x"
  ["x "," x"]
  "x x"
  "x x"
  "x x"
  "x x"
  "x x"
  ["x ","x "]
  "x x"
  "x x"
  "x x"
  "x x"
  "x x"
Gabriella439 commented 11 years ago

Yeah, I like your idea, because this very closely matches how words is implemented in Data.ByteString.Char8:

words = P.filter (not . B.null) . B.splitWith isSpaceWord8

I will go ahead and fix that, too, (and upload both lines and words).

michaelt commented 11 years ago

By the way, do you think what is desired under a type like

 decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

will forward as Text just as much as is intelligible as text, returning a Producer of Bytestrings that starts where the other left off? It occurs to me that this might not be as useful as it seems: if I know that a stream contains a mixture of text and binary information, I should have some sort of parser telling me where text leaves off and binary goo begins, since binary goo can look like anything you please.

In any event, the fancy continuations supplied by the streamDecodeUtf8 function in Data.Text.Encoding are not seeming to be aimed at this. The idea, I think, is just to make things like this possible:

pipeDecodeUtf8 :: Monad m => Pipe ByteString Text m r
pipeDecodeUtf8 = go streamDecodeUtf8 where
          go decoder = do 
              bs <-  await
              case decoder bs of   
                   Some txt unread newdecoder -> 
                          do yield txt
                               go newdecoder

which I just added, with a variant that takes a replacement function.

As it is implemented here, so far, decodeUtf8 does not give the result described above. It stops forwarding text after the first chunk that ends in the middle of a character, which defeats the purpose of the streamDecodeUtf8 function, which is to carry character fragments over. Otherwise it fails, since if streamDecodeUtf8 meets bytes that aren't the beginning or end of a character, it raises an exception. One could try to catch these in IO, which might be done comparatively sensibly; if one tries to do it purely one needs to avoid the error with a replacement function, then somehow recover the information that a substitution was made, etc. -- i.e. each step would require some sort of comparison of the resulting text with the bytestring it came from -- or something like that.

In any case, there would still be the problem of dividing a failed chunk into the 'good part', to be textified, and the rest, to be returned as the first yield of the ByteString producer. I.e. the same problem in the small. So either way one would end up in the kind of complexities we see here with byteSplits and splitSlowly: http://hackage.haskell.org/package/conduit-1.0.8/docs/src/Data-Conduit-Text.html#decode

Or perhaps bos' streamDecodeUtf8 could do with some further complication.

Gabriella439 commented 11 years ago

So if you have preexisting information about where the text ends and the bytes begin, you would just use something like Pipes.ByteString.splitAt to partition the byte stream into those two halves:

bytestream
    :: Producer ByteString m ()

splitAt 1024 bytestream
    :: Producer ByteString m (Producer ByteString m r)

decodeUtf8 (splitAt 1024 bytestream)
    :: Producer Text m (Producer ByteString m (Producer ByteString m r))

... where the first Producer ByteString would correspond to undecodable bytes before reaching the true byte stream and the second Producer ByteString would correspond to what we originally intended to be the byte payload.

I want to briefly comment that perhaps the type of decodeUtf8 should be:

decodeUtf8 :: (Monad m) => Producer ByteString m r -> Producer ByteString m (Either (Producer Bytestring m r) r)

... where it returns a Left value if it encountered bad bytes and a Right value if it successfully got to the end. This would make it so that the user would not need to test the byte residue to see if there were bad bytes or not.

streamDecodeUtf8 is actually correct and has all the machinery necessary to keep going if a chunk ends in the middle of a character. You can see the original discussion behind it here, but I'll briefly summarize it.

streamDecodeUtf8's continuation takes into account the incomplete character from the previous chunk (i.e. it remembers that it is in the middle of decoding a character). The reason it also returns a ByteString field is to handle the case where there are no more chunks. If you still have chunks left you are supposed to ignore that field and just keep feeding those chunks to the continuations.

So in other words, if you have two ByteString chunks, bs1, and bs2, where bs1 contains a byte at the end that does not completely decode, then the Decoding will return the Text that it successfully decoded up until that last byte, and the continuation will already know that it is in the middle of parsing a character, so you can just pass bs2 directly to the continuation. The Decoding returned by that continuation will return a Text chunk whose byte representation includes the byte from the previous t1 ByteString chunk.

Now, if t2 ends with an incomplete character and you have no more chunks left to feed the Decoding, then you just take the ByteString it returns and make that the final undecode residue.

However, the residue may be greater than a single ByteString chunk. This will happen if you have a decoding error (not an incomplete byte sequence, but rather an invalid UTF8 byte sequence, and these do exist). This can happen in the middle of any chunk (not just the last one), in which case you need to return all subsequence chunks as undecoded bytes.

ibotty commented 11 years ago

hi,

having written the implementation of decodeUtf8, are you sure it is wrong? i am fairly sure text's streamDecodeUtf8 includes the remaining byte(s) only when there aren't any chunks anymore. my testing did not show any such cases you describe but i might have missed something (or plenty).

i guess i am asking: did you come to your conclusion by testing?

pipeDecodeUtf8 :: Monad m => Pipe ByteString Text m r
pipeDecodeUtf8 = go streamDecodeUtf8 where
  go decoder = do bs <-  await
                            case decoder bs of 
                                  Some txt unread newdecoder-> do yield txt
                                                                                          go newdecoder

i am sure you know, but i want to mention it. you can use decodeUtf8 Pipes.ByteString.stdin >>= const (return ()

which (nearly) gives you the function above. maybe. i'm not sure whether i am (as usually) missing something.

greetings, tobi

michaelt commented 11 years ago

Yes I understand this, I think. The ByteString field is not needed for the Pipe I defined; the continuation carries over the incomplete character information. However, the incomplete character information is contained in the bytestring information. So if we test for its being null, we take too much interest in it, so to speak. (One could try to test for its being less than 3 bytes or whatever.) The trouble is the case of failure in the middle so to speak. Intuitively this should yield, in the most conservative case: any text recovered from an initial segment; the remaining undigested bytestring; and then so to speak Nothing for the continuation. Then our decodeUtf8 would forward the text, and return the bytestring with the remaining producer. But maybe there is a way of evading the errors streamDecodeUtf8With gives for strictDecode on bytestrings with non-character information.

Gabriella439 commented 11 years ago

What's wrong with the solution I proposed to use a splitter? Pipes.ByteString.splitAt or Pipes.ByteString.span (or any other custom splitter) can partition the Producer ByteString into two halves: the first halve is what decodeUtf8 should assume is all valid Text. The second halve is non-character information. This requires no modification to streamDecodeUtf8 or decodeUtf8 and it's a nice separation of concerns.

Moreover, it allows decodeUtf8 make stronger statements when parses fail: if the parse failed it is because it was definitely an invalid or incomplete byte sequence and not a non-character payload. If you change them to assume the possibility of non-textual data then there is no way to opt out of that assumption when you know that the data is in fact supposed to be textual.

michaelt commented 11 years ago

Yes, I thought I understood this before, but here is the result of breaking a' = encodeUtf8 $ T.pack $ "你好" , i.e. "\228\189\160\229\165\189" into two bytestrings at each possible position and feeding the fragments successively to streamDecodeUtf8 https://gist.github.com/michaelt/7377887#file-output

Gabriella439 commented 11 years ago

I don't see the problem. It correctly decoded every character and didn't add or remove any spurious bytes.

michaelt commented 11 years ago

Sorry, I'm still not making this clear. The important point is the 'leftover bytestring' element. Though this is a perfectly good pure text stream, decodeUtf8 will stop it after the second chunk of the twelve that are sent, since it inspects the bytestring field of Some txt bs f

Gabriella439 commented 11 years ago

Oh, I see. decodeUtf8 has no way to tell if there was a decoding error or not (i.e. an invalid UTF8 byte sequence). At least, it could but it require sophisticated logic that really belongs in streamDecodeUtf8.

One solution to this is to modify the Decoding type to include a constructor to indicate decoding failure so that decodeUtf8 can know when it needs to continue, like this:

data Decoding = Some ... | FailedParse ByteString

However, this makes it impossible to resume from invalid byte sequences.

Another possibility is that in case of invalid UTF8 byte sequences you simply give the user the failed bytes and let them decide whether or not to continue:

data Decoding = Some ... | FailedParse Word32 (ByteString -> Decoding)
Gabriella439 commented 11 years ago

Oops, I forgot that one character might not be exactly 32 bits in UTF8, but you get the idea. Just replace Word32 with ByteString or some other representation for an invalid byte sequence.

michaelt commented 11 years ago

decodeUtf8 has no way to tell if there was a decoding error or not (i.e. an invalid UTF8 byte sequence). At least, it could but it require sophisticated logic that really belongs in streamDecodeUtf8

Right, this is what I was trying to say in a maybe too opaque way at the outset. To recover what should (intuitively) be in streamDecodeUtf8 as things are would require something like the conduit text business. Thats why I said that as it is it is really only suited to making a naive Pipe ByteString Text m r , or so it was seeming.

michaelt commented 11 years ago

My initial remark was somewhat complicated by my suspicion that bos was aware of this feature of the function. streamDecodeUtf8 is the stream-y equivalent of Data.Text.decodeUtf8 :: ByteString -> Text which isnt matched by a decodeDecodableUtf8 :: ByteString -> (Text,ByteString) but this is what we are looking for the streamy equivalent of.

Gabriella439 commented 11 years ago

I honestly think we should just propose extending Decoding to something like

data Decoding = Some Text ByteString (ByteString -> Decoding) | FailedParse BadBytes (ByteString -> Decoding)

data BadBytes = ???

That would allow us to both detect and recover from failed parses.

michaelt commented 11 years ago

Yes, I was just thinking of

data Decoding = Some Text ByteString (ByteString -> Decoding) | Other Text ByteString

or the like, since I thought this would not be too hard to implement given what he has. I was looking for example at the three possibilities in https://github.com/bos/text/blob/master/cbits/cbits.c#L115 The one we are interested in is the second. When it gets to this point, it ought to have constructed the text array that it could construct and has now backed up to the point in the bytestring where it went wrong. So why shouldn't it present them with us both?

There is the question of the character information it took into the parse. If it fails immediately, so that the Text field of my imagined Other is empty, we can still recover the missing bytes from the previous step in any recursive definition:

decodeUtf8 = go B.empty streamDecodeUtf8 where
  go old dec p = ...

but I don't know.

Gabriella439 commented 11 years ago

Note that if Other does not return a continuation then there is no way to resume from a failed parse.

I don't have enough time at the moment to review the C code, but I will check it out later to study what we have to work with.

michaelt commented 11 years ago

Other doesn't contain the missing information, but I was thinking any recursive use of an associated streamDecodeUtf8 can recover it from the previous step (which is either the base case or a successful step) as the remaining chunk. But yes, I have no clue what the best idea is; that was just the picture I had formed in my head before mentioning this.

michaelt commented 11 years ago

I implemented the simple idea I mentioned https://github.com/michaelt/text/commit/942e4c894fbb6cafd8aafdc5c5f148e85ac66dbf

Even if the type is defensible, some of the logic could do with more thought. In the type

streamDecodeUtf8With :: Maybe OnDecodeError -> ByteString -> Decoding

the Maybe OnDecodeError bit is maybe a little simple minded: it basically lets you decide whether you will go about things in the way we seem to be imagining (Nothing) or that you will use the exceptions and/or replacement scheme bos has in place.

The revised version of ibotty's function would look something like this:

decode = go B.empty (streamDecodeUtf8With Nothing) 
  where
   go old dec p = do 
       x <- lift (next p)
       case x of 
        Left r -> return (return r)
        Right (chunk, p') -> do
          case dec chunk of 
            Some txt new dec' -> do 
              yield txt
              go (accum txt old new) dec' p'
            Other txt bs -> do
              yield txt
              return $ do yield (accum txt old bs)
                          p'

   accum txt old new | T.null txt = old <> new 
                     | otherwise  = new

I think that's correct; it's based on a [ByteString] function I put in the quickcheck Properties file. The unpleasant necessity of constantly checking the Text field to see if it is null becomes clearer when you imagine that as you near the transition from Text to nonsense, you are being given one-byte bytestrings, or empty bytestrings. The developing continuation might consume any number of such bytestrings before it commits a Text, but then at the last moment realize it can't. So one won't be able to recover the information it contains without keeping track. Maybe this could somehow be put into streamDecodeWith with a better type.

Gabriella439 commented 10 years ago

Ok, so back to the original topic (lines and words), I finally pushed the fix to lines and simplified the implementation of words for pipes-bytestring. You can consult those for the Text equivalents.

michaelt commented 10 years ago

This words is nice demonstration of pipes style or pipes-parse style. I ran my little tests and made the corresponding changes. I will look into the RULES business.

michaelt commented 10 years ago

Esteemed comrades Gabriel and Tobias, I opened an issue corresponding to this business https://github.com/bos/text/issues/60 when it emerged that the new version of text is already on hackage. I guess this was the right thing to do? I mentioned it also to bgamari wondering if he might have a thought.

michaelt commented 10 years ago

I was about to study the lenses branch of pipes-parse etc, when it occurred to me to see what would happen if one had done with the problem above by just recycling the cbits from text and ringing changes. The result is the branch https://github.com/michaelt/text-pipes/tree/stream I am pretty sure the material taken from text and revised is sound (unless by a subtlety to do with the character replacement scheme text uses). The revised Text.Pipes module (which is crude and has a couple bugs I've already thought of) is a little more complicated on further reflection, since the type of something like fromHandle is Handle -> Producer Text IO (Producer ByteString IO r) -- similarly for basically all things beginning Producer Text ... Of course, that sort of seems right.

I don't know what to make of such an idea. One question it obviously raises is whether to have done with the character replacement business which would in one way simplify things, and in another way complicate them.

Gabriella439 commented 10 years ago

Yeah, that's the right type, since that is what you would get if you applied decodeUtf8 to Pipes.ByteString.fromHandle.

I also want to note that the pipes-bytestring fromHandle and toHandle now run as fast as conduit in HEAD. I found out what was causing the slowdown, which was that pipes-bytestring was using hIsEOF to test for end of input, whereas conduit was just checking to see if the retrieved ByteString was empty. Now that I have switched to testing the ByteString it runs as fast as conduit now.

This means that you don't need to provide a separate fromHandle function for pipes-text. decodeUtf8 suffices and then people can reuse Pipes.ByteString.fromHandle. The other reason is so that you don't have to define all the minor variations on fromHandle, like hGet/hGetN/etc. The third reason is that you don't want to have to define a separate fromHandle function for each possible decoding (once we move beyond UTF8 decoding).

Also, note that there is a lenses branch of pipes-bytestring that has the API upgraded to use all the new pipes-parse idioms. However, I would caution you not to spend the effort to upgrade pipes-text to the new parsing idioms until I merge them into master. I want to put them through their paces more to see if there are any awkward parts of the API that I might have missed so far.

michaelt commented 10 years ago

Ah, yes, I had already taken the hIsEof bit on board, somehow; fromHandle as it was just inlined PB.fromHandle together with decodeUtf8.

In the version of the branch that I just pushed https://github.com/michaelt/text-pipes/tree/stream I have scrapped character replacement entirely which vastly simplifies everything including the quickcheck tests.

I haven't removed things like fromHandle and readFile , as I think ibotty and I were both thinking the module might well include some material akin to Pipes.Prelude (and say Data.Text.Lazy). Maybe this is turning out to be wrong, as the material is seeming a bit more technical.

On the other hand I wonder if a crude decoder function and IO accompaniments might not be desirable as an emulation of the text package and maybe to facilitate interoperation with it? That e.g. decodeUtf8 has type ... -> Pipe Text m (Pipe Bytestring m r) rather than ... -> Pipe Text m r might seem a somewhat heavy-weight method of error handling. I wonder if the text package would be as usable if something like decodeUtf8 had the type ByteString -> (Text, ByteString) , and if e.g. the things in Data.Text.IO like readFile were similarly typed. But I don't see how this could be done sensibly.