Why do we requeue a response on error?

GoogleCodeExporter commented 9 years ago

Refer to comments on line no 711.

How this more efficient? The client anyways has to wait to for the response 
before it can process responses after that request.

Original issue reported on code.google.com by satyamsh...@gmail.com on 6 Dec 2011 at 9:50

GoogleCodeExporter commented 9 years ago

which file?
plz. can you provide a URL such as this one? 
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js?spec=
svn472&r=472#29

Original comment by dhruvb...@gmail.com on 6 Dec 2011 at 1:11

GoogleCodeExporter commented 9 years ago

oh sorry.. I didn't say which file it was. :).. line no 711 in session.js.

http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js?spec=
svn474&r=474#711

Original comment by satyamsh...@gmail.com on 6 Dec 2011 at 2:29

GoogleCodeExporter commented 9 years ago

IIRC, this is done so that failed responses can be replayed, since the client 
has no way of knowing which request lead to this response. Besides, the proxy 
should replay any response it has got from the server but failed to deliver to 
the client.

If we don't do this, then the client will have to explicitly request for the 
missing RID once it has realized that it's not forthcoming, and it would incur 
an extra wait+RTT delay. Does that make sense?

Original comment by dhruvb...@gmail.com on 6 Dec 2011 at 5:29

GoogleCodeExporter commented 9 years ago

Wouldn't the client anyways have to re-request that RID, and process responses 
in that order?

Original comment by satyamsh...@gmail.com on 7 Dec 2011 at 6:57

GoogleCodeExporter commented 9 years ago

More often than not, this is what would happen (could be wrong).

1. Client sends request with RID=10
2. Client sends request with RID=11
3. Server sends response to RID=10 and RID=11 (on separate responses for 
whatever reason), but 10 fails due to some reason.
4. Client sends request with RID=12 (since the user wants to send a message)
5. Server re-sends the response of RID=10 on RID=12

This is out of order, but it doesn't cause a problem since the client will 
anyway reorder the responses before showing them to the user (or processing 
them in any way).

OTOH, if the server didn't do this, the client would have to explicitly 
re-request RID=10, incurring an extra round-trip overhead.

Original comment by dhruvb...@gmail.com on 7 Dec 2011 at 11:55

GoogleCodeExporter commented 9 years ago

I dont think the spec says that you can send a response for RID 10 on
request for RID 12. -- meaning the client might not rely on the RID sent
from the server.

Original comment by satyamsh...@gmail.com on 7 Dec 2011 at 12:22

GoogleCodeExporter commented 9 years ago

My bad, I didn't recall what the behaviour would be. This is the scenario:

1. Client sends request with RID=10
2. Client sends request with RID=11
3. Server sends response to RID=10 and RID=11 (on separate responses for 
whatever reason), but 10 fails. We assume that 11 also fails (this is an 
incorrect assumption, but works most of the time in practice). Alternatively, 
since hold is almost always 1, there is no held request on which 11 can be 
sent, so we are good.
4. Client sends request with RID=12 (since the user wants to send a message)
5. Server re-sends the response of RID=10 on RID=11
6. Request with RID=13 is sent and server sends back response with RID=12. The 
client need not know the responses are offset by 1 since when the client 
re-requests RID=10, it will get back an empty body. As far as the client is 
concerned, the stream is consistent.

The reason that this is an optimization is that many clients don't bother 
re-requesting a failed request and just move ahead. In such a case, we send all 
the data from the server to the client with a higher probability than if we 
rely on the client re-requesting the failed transaction. Hence, on the average, 
we hope to be more reliable by detecting network failures at the server end and 
taking corrective action there itself. Of course, there is a non-zero 
probability of out of order responses for correctly behaving clients.

In case of proxies, we rely on the proxy faithfully (and almost immediately) 
reporting back with a failure for this to work.

This can in theory lead to out of order responses, but I've not yet seen any, 
so this is more of a pragmatic thing that a pedantic thing.

Original comment by dhruvb...@gmail.com on 7 Dec 2011 at 1:13

GoogleCodeExporter commented 9 years ago

> We assume that 11 also fails (this is an incorrect assumption, but works most 
of the time in practice).  Alternatively, since hold is almost always 1, there 
is no held request on which 11 can be sent, so we are good.

Why is this assumption incorrect?

> The reason that this is an optimization is that many clients don't bother 
re-requesting a failed request and just move ahead.

This sounds wrong to me though I dont have any evidence against it. All the 
clients that we have written here dont do this. 

Also, I don't think messages are removed from this.unacked_responses -- leading 
to duplication when the client re-requests that RID.

Original comment by satyamsh...@gmail.com on 7 Dec 2011 at 4:00

GoogleCodeExporter commented 9 years ago

>> We assume that 11 also fails (this is an incorrect assumption, but works 
most of the time in practice).  Alternatively, since hold is almost always 1, 
there is no held request on which 11 can be sent, so we are good.

> Why is this assumption incorrect?

Because sometimes, hold might be > 1 (though I've never seen this configuration 
in the wild).

>> The reason that this is an optimization is that many clients don't bother 
re-requesting a failed request and just move ahead.

> This sounds wrong to me though I dont have any evidence against it. All the 
clients that we have written here dont do this. 

True, though this was done when we didn't have so many clients! Besides, the 
behaviour is based more on the union of all the popular clients out there.

If this is causing trouble, then there are serious network issues that need to 
be fixed rather than fixing this since the probability of such errors occurring 
on a relatively decent network is very low. This is because the failure window 
is if the connection is lost while waiting for a response from the xmpp proxy. 
On the average, if it takes time wait/2, which is significantly greater than 
RTT, then the case that is handled covers a much larger area of the graph. Only 
if the network connection is lost post sending the response and the response 
didn't get delivered to the client will there be a problem.

> Also, I don't think messages are removed from this.unacked_responses -- 
leading to duplication when the client re-requests that RID.

This shouldn't happen - if it does, it's an error in the implementation. 
unacked_responses isn't used for streams that don't ACK. Additionally, 
unacked_responses is cleared for streams that do ACK.

Original comment by dhruvb...@gmail.com on 7 Dec 2011 at 4:20

GoogleCodeExporter commented 9 years ago

> Because sometimes, hold might be > 1 (though I've never seen this 
configuration in the wild).

Oh ok.. now I get it. I thought u meant 11 couldn't fail. :)

> Besides, the behaviour is based more on the union of all the popular clients 
out there.

Is this a standard that every bosh server follows? If it is not, it can lead to 
message loss for these clients. To me it just doesn't sound right.

No, it is not causing any trouble yet. I am just trying to understand the 
reasoning behind it a bit better. :)

> This is because the failure window is if the connection is lost while waiting 
for a response from the xmpp proxy.

xmpp proxy? I didn't get this at all. How s wait time related to response from 
xmpp proxy? From what I understand, the failure window is whenever there is an 
error event on response object - which I think should only happen when there is 
a write error on the underlying socket -- implying a broken connection. Here I 
am making the assumption that the xmpp server is continuously sending data. 
Also, the connection is pretty weak for EDGE so I wouldn't be surprised if it 
creates issues.

Original comment by satyamsh...@gmail.com on 7 Dec 2011 at 8:31

GoogleCodeExporter commented 9 years ago

>> Besides, the behaviour is based more on the union of all the popular clients 
out there.

> Is this a standard that every bosh server follows? If it is not, it can lead 
to message loss for these clients. To me it just doesn't sound right.

I haven't seen the guts of other implementations, so I can't really say. This 
sounded reasonable to me then so that's how it turned out. If you have data to 
show that it's a bad choice (or worse than some other alternative), feel free 
to fix it!!

> No, it is not causing any trouble yet. I am just trying to understand the 
reasoning behind it a bit better. :)

>> This is because the failure window is if the connection is lost while 
waiting for a response from the xmpp proxy.

> xmpp proxy? I didn't get this at all. How s wait time related to response 
from xmpp proxy? From what I understand, the failure window is whenever there 
is an error event on response object - which I think should only happen when 
there is a write error on the underlying socket -- implying a broken 
connection. Here I am making the assumption that the xmpp server is 
continuously sending data. Also, the connection is pretty weak for EDGE so I 
wouldn't be surprised if it creates issues. 

Sorry, I meant "xmpp server". If your comment is relevant event now, I shall 
try to parse it ;)
If you measure the mean, and median median wait times for a response object 
(time between which it was received and between which it left), you'll be 
surprised to find it to be closer to the upper bound (wait value) than he mean 
of the wait value.

Original comment by dhruvb...@gmail.com on 7 Dec 2011 at 8:52

GoogleCodeExporter commented 9 years ago

Oh now I understand the point you are trying to make. However, the failure 
window is not necessarily that small. I might be totally wrong on this but the 
error event might be emitted late(when trying to write on the socket). So, no 
matter when the connection breaks we ll get to know only when we try writing 
out a response on that socket.

Original comment by satyamsh...@gmail.com on 7 Dec 2011 at 9:59

GoogleCodeExporter commented 9 years ago

> the error event might be emitted late(when trying to write on the socket). 
So, no matter when the connection breaks we ll get to know only when we try 
writing out a response on that socket.

Irrespective of when the error event is sent, we always run into trouble since 
failed packets are appended to the pending list on failure. This inherently 
destroys their relative order and hence the sanity of the stream.

The 100% guaranteed correct way of doing it is what you are suggesting - send 
and put into sent buffer - and then resend if re-request. The only problem is 
that clients don't always re-request failed responses.

Original comment by dhruvb...@gmail.com on 8 Dec 2011 at 12:18

GoogleCodeExporter commented 9 years ago

So you can see that there is a tradeoff between being 100% correct for clients 
that themselves are 100% correct and surely failing for the others OR being 99% 
correct for 100% of the clients. Don't know which one is better. If we have 
data on what weighted (by usage) fraction of clients are correct in 
re-requesting failed packets, we can make a more informed choice.

Original comment by dhruvb...@gmail.com on 8 Dec 2011 at 12:20

GoogleCodeExporter commented 9 years ago

> Irrespective of when the error event is sent, we always run into trouble 
since failed packets are appended to the pending list on failure. This 
inherently destroys their relative order and hence the sanity of the stream.

Yes, but I thought we were debating how frequent this will be. I am of the 
opinion that it doesn't matter when connection breaks because we ll only get to 
know about it when we try writing on the socket. So, the failure window is 
independent of time when the disconnection occurs (since we get to know only 
when we try writing on it) -- making it pretty wide.

> The only problem is that clients don't always re-request failed responses.

This shakes the foundations of the spec for me. :) .. which client does this?

This has many caveats. Not only is it vulnerable for out of order messages, but 
also message loss  The server can't know for sure whether the response was 
actual received by the client. If the client doesn't re request those rid it 
might lose some messages.

> So you can see that there is a tradeoff between being 100% correct for 
clients that themselves are 100% correct and surely failing for the others OR 
being 99% correct for 100% of the clients.

This is sounds a bit exaggerated. :).. Firstly, the failure % will be same for 
those -- let's say "optimized" -- if we change the implementation. Secondly, I 
very highly doubt 100% of the clients do this, that is everyone except us. :)

Original comment by satyamsh...@gmail.com on 8 Dec 2011 at 2:45

GoogleCodeExporter commented 9 years ago

>> Irrespective of when the error event is sent, we always run into trouble 
since failed packets are appended to the pending list on failure. This 
inherently destroys their relative order and hence the sanity of the stream.

> Yes, but I thought we were debating how frequent this will be. 

True, for decent networks, this shouldn't be much. Besides, this case is 
triggered ONLY if there are multiple responses waiting to be written. 
Generally, the length of the pending responses queue is 1. So to get the 
failure probability, you need to compute the probability that the response was 
dropped AND the length of the pending responses is greater than 1.

> I am of the opinion that it doesn't matter when connection breaks because we 
ll only get to know about it when we try writing on the socket. 

Yes, before we write to the socket, we can't say that the write failed. That 
apart, in some scenarios, we won't know if a write failed a long time after we 
actually do the write. In fact, we can't set a reasonable upper bound on the 
time to wait since it is a tradeoff between either this or performance 
(responsiveness).

> So, the failure window is independent of time when the disconnection occurs 
(since we get to know only when we try writing on it) -- making it pretty wide.

Not really. More often than not, if the failure occurs before the send takes 
place, the 'error' event is triggered almost instantly. So, the connection can 
be broken any time before we do the write. We can compute the probability this 
way:
P(error raised instantly) = time interval between send and object received / 
wait time

>> The only problem is that clients don't always re-request failed responses.

> This shakes the foundations of the spec for me. :) .. which client does this?

Don't remember, but I had tested a bunch of clients to use for the 
node-xmpp-bosh and strophe.js behaved correctly most of the time - which is why 
it was chosen as the test harness.

> This has many caveats. Not only is it vulnerable for out of order messages, 
but also message loss  The server can't know for sure whether the response was 
actual received by the client. 

BOSH is a transparent proxy. Servers generally don't care about an XMPP proxy.

> If the client doesn't re request those rid it might lose some messages.

True - some people are okay with this since it doesn't control missles!!

>> So you can see that there is a tradeoff between being 100% correct for 
clients that themselves are 100% correct and surely failing for the others OR 
being 99% correct for 100% of the clients.

> This is sounds a bit exaggerated. :).. Firstly, the failure % will be same 
for those -- let's say "optimized" -- if we change the implementation. 
Secondly, I very highly doubt 100% of the clients do this, that is everyone 
except us. :)

Obviously, the number have been invented!!

Original comment by dhruvb...@gmail.com on 8 Dec 2011 at 9:40

GoogleCodeExporter commented 9 years ago

> True, for decent networks, this shouldn't be much. Besides, this case is 
triggered ONLY if there are multiple responses waiting to be written. 
Generally, the length of the pending responses queue is 1. So to get the 
failure probability, you need to compute the probability that the response was 
dropped AND the length of the pending responses is greater than 1.

No, every time a response fails we requeue it, irrespective of the length of 
pending responses. (_on_no_client also requeues it). 
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js#752

> More often than not, if the failure occurs before the send takes place, the 
'error' event is triggered almost instantly.

This I didn't know. I thought socket disconnections are very difficult to find 
out -- and you do that by writing onto it. If the socket doesn't send a FIN and 
closes its end the server might not know anything about it, and no error event 
would be raised.

>> This has many caveats. Not only is it vulnerable for out of order messages, 
but also message loss  The server can't know for sure whether the response was 
actual received by the client. 

> BOSH is a transparent proxy. Servers generally don't care about an XMPP proxy.

Sorry, I meant BOSH server. An HttpServer can't know for sure whether the 
client received the response or not. It might not receive any error event even 
if the delivery failed(and hence not requeue the response). If the clients are 
not re requesting every RID that response is lost.

Original comment by satyamsh...@gmail.com on 8 Dec 2011 at 10:39

GoogleCodeExporter commented 9 years ago

> No, every time a response fails we requeue it, irrespective of the length of 
pending responses. (_on_no_client also requeues it). 
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js#752

Yes, but the failure (to operate correctly, which in this case is give messages 
in-order) happens when there is > 1 response to send. If there is just 1 
response to send, en-queuing at at the back of the queue is correct. If there 
are 2 responses (say), then you've effectively swapped their positions, which 
is wrong (which is what currently happens)!

> It might not receive any error event even if the delivery failed(and hence 
not requeue the response). If the clients are not re requesting every RID that 
response is lost.

You are guaranteed to receive an 'error' event if sending failed. This is one 
of the basic building blocks network server work on! I think what you mean is 
that you might receive an error event even if sending succeeded - right (which 
can happen - again, with a very small probability)?

Either ways, clients are supposed to re-request a failed RID.

Original comment by dhruvb...@gmail.com on 8 Dec 2011 at 11:50

GoogleCodeExporter commented 9 years ago

> Yes, but the failure (to operate correctly, which in this case is give 
messages in-order) happens when there is > 1 response to send. If there is just 
1 response to send, en-queuing at at the back of the queue is correct. If there 
are 2 responses (say), then you've effectively swapped their positions, which 
is wrong (which is what currently happens)!

What about the case when there is only one response but the error event is 
raised late. 

1. response for RID 10 fails.
2. request RID 11 arrives
3. response for 11 sent (which will be succeed).
4. error on response for RID 10.

>> It might not receive any error event even if the delivery failed(and hence 
not requeue the response). If the clients are not re requesting every RID that 
response is lost.

> You are guaranteed to receive an 'error' event if sending failed. This is one 
of the basic building blocks network server work on! I think what you mean is 
that you might receive an error event even if sending succeeded - right (which 
can happen - again, with a very small probability)?

Pardon the lack of clarity on my part. The error event will always be received. 
You are right. But again like you mentioned this also exposes it to duplication.

>> More often than not, if the failure occurs before the send takes place, the 
'error' event is triggered almost instantly.

> This I didn't know. I thought socket disconnections are very difficult to 
find out -- and you do that by writing onto it. If the socket doesn't send a 
FIN and closes its end the server might not know anything about it, and no 
error event would be raised.

Can you elaborate on this. I can't seem to reason why/how the 'error' event 
will be triggered almost instantly.

> Either ways, clients are supposed to re-request a failed RID.

To summarize we are optimizing this for a scenario where the network connection 
is bad. But if the network connection is bad, we are also exposing ourselves to 
duplication and incorrect order of stanzas. Isn't simpler/straight forward 
solution without many edge cases better? If the clients re-request a failed RID 
the level of optimization goes down anyways. Do you think we should get rid of 
this?

Original comment by satyamsh...@gmail.com on 9 Dec 2011 at 8:18

GoogleCodeExporter commented 9 years ago

>>> More often than not, if the failure occurs before the send takes place, the 
'error' event is triggered almost instantly.

>> This I didn't know. I thought socket disconnections are very difficult to 
find out -- and you do that by writing onto it. If the socket doesn't send a 
FIN and closes its end the server might not know anything about it, and no 
error event would be raised.

> Can you elaborate on this. I can't seem to reason why/how the 'error' event 
will be triggered almost instantly.

Oh, you meant instantly when we try to write on the socket? Even if that is the 
case the responses will get reordered if there is a pending response. Now that 
we stitch packets right before sending response we probability of incorrect 
ordering is reduced since the stanzas coming from xmpp-server are queued 
separately. Before doing this, incorrect ordering would have been independent 
of the number of pending responses.

Original comment by satyamsh...@gmail.com on 9 Dec 2011 at 8:36

GoogleCodeExporter commented 9 years ago

Okay, let's try to map this to a well known problem in probability.

You are standing at a bus stop between 1:00pm & 1:45pm, waiting for a bus. The 
bus could come form anywhere between 1:00pm & 2:00pm. What is the probability 
that you get it?

The answer of course is 3/4 or 0.75 since that is the area of the 2 pie-charts 
(of you and the bus) that intersect divided by the area of the pie-chart of the 
bus.

Analogously, if we apply the same reasoning to our domain, if a response object 
comes in at t=10. Let's suppose that wait=20 sec. If our observations say that 
on the average, a packet stays with us for T=17 sec, then what is the 
probability that the connection was lost while we were holding it? It is 
17/(17+RTT). Here, RTT is a round trip time, which is the time required for a 
packet to come to us and return back to the sender. RTT is generally of the 
order of 300-500ms on decent networks. If the connection breaks any time before 
we actually do the send, we will be notified of the error as soon as we do the 
send (depending on the actual implementation, this may be in the same tick or 
the very next tick in node.js). (the numerator is NOT (RTT/2 + 17) because the 
connection did NOT break on our way to us).

Suppose we send out a response at T=17 sec and the connection breaks _after_ we 
send it out? What is the probability of such an event occurring? It is 
(RTT/2)/(17+RTT), since the connection could break at any point on the return 
journey.

We will fail if the XMPP server sends us a packet in that window of RTT/2 sec 
because if we are notified of the failure after we have buffered the packet 
from the XMPP server, then we will be inserting the original packet at the back 
of the queue of buffered packets.

Of course, we will never have more than 1 XMPP packet to send to the client at 
any time since we are merging them all together before sending them.

This analysis is valid for one stream. We can reorder responses at will across 
streams and not affect the correctness. Make sense?

The error even is fired almost instantly since the OS knows that the socket is 
disconnected (if the client actually closed it), but will not raise an event 
unless you try to send something on it.

Of course the correct way to implement it would be to buffer the response on 
send and resend when it only when it is re-requested, and it _should_ be done 
that way.

Original comment by dhruvb...@gmail.com on 9 Dec 2011 at 1:26

GoogleCodeExporter commented 9 years ago

Thanks for the detailed analysis. :) .. Will make the change in the next commit.

Original comment by satyamsh...@gmail.com on 11 Dec 2011 at 5:21

GoogleCodeExporter commented 9 years ago

Original comment by satyamsh...@gmail.com on 5 Jan 2012 at 8:48

Changed state: Fixed

jijo-paulose / node-xmpp-bosh

Why do we requeue a response on error? #34