ZdcSocketStream atEnd may return false while a subsequent next returns nil

Rinzwind commented 5 years ago

I'm wondering whether there is a bug in the implementation of atEnd on ZdcSocketStream (inherited from ZdcSimpleSocketStream):

https://github.com/svenvc/zodiac/blob/e24882abdd4c7edbd5da8318131465704eb61672/repository/Zodiac-Core.package/ZdcSimpleSocketStream.class/instance/atEnd.st#L2-L13

So as far as I understand: if the buffer is empty, and no data is available, the method will return false if the socket is still connected. But that does not guarantee that the stream is not at the end, as the socket may simply be disconnected without additional data becoming available? Or am I misunderstanding something about how this works?

Case in point: we have some tests that sometimes fail while reading a ZnResponse, which seems to be due to #atEnd returning false, and a subsequent #next returning nil. As shown on the screenshots below: a ZnIncomplete is signaled while reading a ZnEntity because #next returned nil, although this is just after checking whether the stream is #atEnd, which must have returned false. (The stream should actually be at the end, as in this test, the response is expected to contain Generating report 'eventfiche' (size: 30) and index in #readInto:startingAt:count:fromStream: is at 31.) Note that the response does not include a Content-Length header, so the stream has not been wrapped in a ZnLimitedReadStream.

svenvc commented 5 years ago

Hi,

As far as I understand it, the existing stream protocol is not perfect since it mixes both returning nil and signalling exceptions to indicate EOF situations (on top of that there is/was actually no EOF exception).

I would define #atEnd to mean that we know for sure that the stream can not ever return any more data, but at the same time it should not block.

In general #next is defined to return an element, and block waiting for one. If the stream is #atEnd, it should return nil. It can throw exceptions.

For a network stream it is sometimes in a state where it does not know if it is really #atEnd, the only way to find out it to try reading, but that would block.

So the situation you describe certainly does exist, but is handled correctly, as far as I understand.

I would say that #atEnd true is a certainty, but #atEnd false not (always).

Maybe #readInto:startingAt:count:fromStream: should handle ZnIncomplete.

But going one level up, if the response has no content length, that means its entity should be read until EOF, it also means that the server should write that response out and close the connection, which would normally be seen by the client. Are you sure the server did that ?

Sven

Rinzwind commented 5 years ago

But going one level up, if the response has no content length, that means its entity should be read until EOF, it also means that the server should write that response out and close the connection, which would normally be seen by the client. Are you sure the server did that ?

I assume the server closed the connection, as otherwise the send of #next would not have returned nil, but would have signaled a ConnectionTimedOut no?

I would define #atEnd to mean that we know for sure that the stream can not ever return any more data, but at the same time it should not block.

In general #next is defined to return an element, and block waiting for one. If the stream is #atEnd, it should return nil. It can throw exceptions.

So that means the bug is actually in ZnUTF8Encoder>>#nextCodePointFromStream:: it should not treat a return value of nil from its first send of #next as an error, but as an “end of stream”?

As a workaround, I have patched the #atEnd condition in ZnUTF8Encoder>>#readInto:startingAt:count:fromStream: from stream atEnd ifTrue: ... to stream peek ifNil: .... At first glance, that seems to fix the problem we are having with some of our tests.

svenvc commented 5 years ago

OK, an alternative could be

[ codePoint := self nextCodePointFromStream: stream ] on: ZnIncomplete do: [  ^ index - offset  ].

which, if it works, would be more intention revealing.

Rinzwind commented 5 years ago

OK, an alternative could be […]

I don't think that would be correct, as it would “swallow” every ZnIncomplete. So the following would return 'e' rather than signaling a ZnIncomplete (as it should, due to the last byte of the encoding of é being cut off):

example := ByteArray streamContents: [ :s | (ZnResponse ok: (ZnStringEntity text: 'eé')) writeOn: s ].
(ZnResponse readFrom: (example copyFrom: 1 to: example size - 1) readStream) contents.

I'm wondering: are there other Stream-like classes where false from #atEnd means the stream may not be at the end, rather than that is not at the end? I was not aware of having to take that into account when using streams, and ZnUTF8Encoder also does not seem to take that semantics into account. Shouldn't ZdcSocketStream offer a separate #atEndNonBlocking, with #atEnd being blocking?

svenvc commented 5 years ago

Yes, there is indeed a difference between EOF at a character boundary and an incomplete UTF8 encoding. I have to think a bit more about this, and I have to try to remember what the reasoning was when I introduced ZnIncomplete

KazunoriUeda commented 5 years ago

@Rinzwind @svenvc Hi, we faced the same problem recently and we tried the workaround below. This is not fundamental but works well for us.

!ZdcIOBuffer methodsFor: '*zodiac-accessing-override'!
availableForReading
    "How many elements can be read ?"

    ^ (writePointer - readPointer) max: 0! !

!ZdcIOBuffer methodsFor: '*zodiac-accessing-override'!
availableForWriting
    "How many elements can be written at the end ?"

    ^ (buffer size - writePointer) max: 0! !

I hope this workaround helps you.

svenvc commented 5 years ago

Hi,

Very interesting, but could you please elaborate a bit more ? How exactly does this solve the problem ?

In principle, the invariant: readPointer <= writePointer should hold for ZdcIOBuffer. If you need to add the max: 0 that would suggest the variant is broken. And then we have to understand how this is possible.

Thx,

Sven

KazunoriUeda commented 5 years ago

Hello Mr. Sven,

We noticed stream atEnd returns false in ZnUTF8Encoder>>readInto:startingAt:count:fromStream:.

readInto: string startingAt: offset count: requestedCount fromStream: stream
    "Read requestedCount characters into string starting at offset,
    returning the number read, there could be less available when stream is atEnd.
    I signal a ZnByteStringBecameWideString notification if necessary"
    | stringBuffer |
    stringBuffer := string.
    offset to: offset + requestedCount - 1 do: [ :index | | codePoint |
        stream atEnd ifTrue: [ ^ index - offset ].  
        codePoint := self nextCodePointFromStream: stream.
...

And then stream next in ZnUTF8Encoder>>nextCodePointFromStream: raised ZnIncomplete.

nextCodePointFromStream: stream
    "Read and return the next integer code point from stream"
    | code byte next |
    (byte := stream next ifNil: [ ^ self errorIncomplete ]) < 128
...

Our system calls a Web API that returns XML response without 'Content-Length' response header, and the response contains some Japanese characters. In the ZnStringEntity>>readFrom:, buffer size is set ZnUtils streamingBufferSize because the content length is nil. And somehow readPointer will be over actual stream data size (I think it is caused by the problem around multibyte index calculation, so our workaround is not a fundamental solution). At last ZdcIOBuffer>>availableForReading returns non-zero value, ZdcIOBuffer>>isEmpty and ZdcSocketStream>>atEndreturn false, but next moment stream next returns nil.

It will not occur when you use ascii text, or the response headers contain 'Content-Length'.

In our case, our workaround fixed ZdcSocketStream>>atEnd behavior and solved the problem.

svenvc commented 5 years ago

I appreciate your detailed description.

This is a difficult bug nonetheless. I've put a conditional breakpoint on both methods with as condition the returned expression being negative. Then I ran all Zn+Zdc tests and the condition was not met.

I am on macOS, 64-bit pharo 7.0.3

From your description the following conditions are needed:

HTTPS (Zdc)
a response with no content-length
a textual entity with non-ASCII, non-Latin1 content (WideString)

I am guessing the following are also needed:

a large response (probably larger than some internal buffer sizes)
windows

Needless to say, we should find a easy to reproduce case that triggers the problem reliably.

mumez commented 5 years ago

Hi,

Actually, We have found the issue when calling Pardot (Salesforce) API with ZnClient. http://developer.pardot.com/kb/api-version-4/visitors/#assigning-and-reassigning-visitors

The API returns a large utf8 encoded XML document in https without content-length property. Pardot is commercial service, so it may be difficult to try the API. But I hope it would help a bit.

KazunoriUeda commented 5 years ago

a large response (probably larger than some internal buffer sizes)

I can't upload actual response because it contains private data, but the response size is 7676 bytes.

windows

Sorry, I didn't mention our develop environment. We use macOS too, and pharo version is 64-bit 6.1. So windows is not needed.

Rinzwind commented 5 years ago

Above I wrote:

As a workaround, I have patched the #atEnd condition in ZnUTF8Encoder>>#readInto:startingAt:count:fromStream: from stream atEnd ifTrue: ... to stream peek ifNil: .... At first glance, that seems to fix the problem we are having with some of our tests.

We wound up using a different workaround. We added an override for #atEnd on ZdcSocketStream as an extension:

atEnd

    readBuffer isEmpty ifFalse: [ ^ false ].
    self isConnected ifFalse: [ ^ true ].
    self fillReadBuffer.
    ^ readBuffer isEmpty

We thought that was a bit cleaner than patching an existing method. This override changes #atEnd to be blocking instead of non-blocking. This ties in with what I was wondering about above:

I'm wondering: are there other Stream-like classes where false from #atEnd means the stream may not be at the end, rather than that is not at the end? I was not aware of having to take that into account when using streams, and ZnUTF8Encoder also does not seem to take that semantics into account. Shouldn't ZdcSocketStream offer a separate #atEndNonBlocking, with #atEnd being blocking?

svenvc commented 5 years ago

@Rinzwind what if you tried the other workaround as well ? It would be interesting to know if we are talking about the same thing.

It is also very important to have a problematic situation that can be reliably reproduced.

Rinzwind commented 5 years ago

@Rinzwind what if you tried the other workaround as well ?

Do you mean the patch on ZnUTF8Encoder>>#readInto:startingAt:count:fromStream: (changing stream atEnd ifTrue: ... to stream peek ifNil: ...). That also fixes the problem we were having with our tests. It's just that when patching an existing method, the patch sometimes gets reverted to the non-patched implementation when updating packages. As ZdcSocketStream doesn't have an existing method for #atEnd (it inherits it), we thought adding an override for #atEnd there was a better solution. But either solution is sufficient for us to fix the problem we were having with our tests.

As for which of these solutions to integrate into Zinc/Zodiac is up to you to decide. The first one is to make ZnUTF8Encoder take into account that #atEnd on ZdcSocketStream is non-blocking, the other one is to make #atEnd on ZdcSocketStream blocking. I'm still wondering though whether there are other Stream-like classes where false from #atEnd means the stream may not be at the end, rather than that is not at the end? Is there any reason not to make ZdcSocketStream offer a separate #atEndNonBlocking, with #atEnd being blocking?

It is also very important to have a problematic situation that can be reliably reproduced.

I'm afraid I can't easily share our test setup in which the problem occurs. The key to being able to reproduce the problem though would seem to be to send a request to some server that gives responses with an UTF8-encoded entity and no Content-Length header (as otherwise, the ZdcSocketStream is wrapped in a ZnLimitedReadStream for which #atEnd returns true as soon as the limit is reached, regardless of whether #atEnd on the ZdcSocketStream still returns false).

svenvc commented 5 years ago

Thanks again for the feedback and the discussion. There is much to think about.

ZdcSocketStream #atEnd semantics (blocking/non-blocking)
ZnCharacterEncoder behavior when reading from closed streams
ZdcIOBuffer issues

But first I am focusing on building a reliable reproducible test case.

KazunoriUeda commented 5 years ago

Hi, Our workaround was not effective on Linux. But the problem solved when we use ZnClient>>beBinary before calling ZnClient>>request: and ZnClient>>execute. Therefore, the response has already been decoded and may be decoded again somewhere.

svenvc commented 5 years ago

OK, noted.

If you load the latest commit (fixed a small issue related to connection closing), you can try the following example, it is a first step to set up a server that gives the response as described.

server := ZnServer startOn: 1701.
"small"
server onRequestRespond: [ :req | | resp |
    resp := ZnResponse ok: (ZnEntity textCRLF: 'Testing 123').
    resp setConnectionClose.
    resp headers clearContentLength.
    resp ].
"larger, non-ascii"
server onRequestRespond: [ :req | | resp |
    resp := ZnResponse ok: (ZnEntity textCRLF: '10 €' , (String loremIpsum: 10000) , ' 10 €').
    resp setConnectionClose.
    resp headers clearContentLength.
    resp ].
"access the response"
ZnClient new get: 'http://localhost:1701'.

However, this still works for me.

Rinzwind commented 5 years ago

I assume this example does not (easily) reproduce the issue because the time between writing the last byte of the response and the closing of the connection is too short. So I added a delay of 100 milliseconds to ZnMultiThreadedServer>>#closeSocketStream:; as shown on the screenshot below, running the example then leads to the ZnIncomplete error:

Screenshot Scenario Sven with Added Server Delay for Zodiac Issue 3

The change to ZnMessage>>#wantsConnectionClose is the fix done in https://github.com/svenvc/zinc/commit/44b599afee3cd1501e3a66f8dca672dfc17e485d.

svenvc commented 5 years ago

Excellent, work. Yes, this fails for me too. It illustrates your earlier points well. Thank you.

A non-destructive way to arrive at the same server behaviour is the following: creating a ZnRequest subclass that overrides the #useConnection: hook (which is used by the WebSockets implementation).

ZnResponse subclass: #ZnResponseWith100msDelay
    instanceVariableNames: ''
    classVariableNames: ''
    package: '_UnpackagedPackage'

ZnResponseWith100msDelay>>#useConnection: connection
    100 milliSeconds wait

And then use it as follows:

server := ZnServer startOn: 1701.
server onRequestRespond: [ :req | | resp |
    resp := ZnResponseWith100msDelay ok: (ZnEntity textCRLF: 'Testing 123').
    resp setConnectionClose.
    resp headers clearContentLength.
    resp ].

The response size, nor it being a WideString seems relevant.

I now have enough to start working on a solution (I think both returning nil from EOF streams when decoding characters, as well as a blocking #atEnd are needed/useful). I hope this won't break anything else.

svenvc / zodiac

ZdcSocketStream atEnd may return false while a subsequent next returns nil #3