Closed GoogleCodeExporter closed 9 years ago
which file?
plz. can you provide a URL such as this one?
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js?spec=
svn472&r=472#29
Original comment by dhruvb...@gmail.com
on 6 Dec 2011 at 1:11
oh sorry.. I didn't say which file it was. :).. line no 711 in session.js.
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js?spec=
svn474&r=474#711
Original comment by satyamsh...@gmail.com
on 6 Dec 2011 at 2:29
IIRC, this is done so that failed responses can be replayed, since the client
has no way of knowing which request lead to this response. Besides, the proxy
should replay any response it has got from the server but failed to deliver to
the client.
If we don't do this, then the client will have to explicitly request for the
missing RID once it has realized that it's not forthcoming, and it would incur
an extra wait+RTT delay. Does that make sense?
Original comment by dhruvb...@gmail.com
on 6 Dec 2011 at 5:29
Wouldn't the client anyways have to re-request that RID, and process responses
in that order?
Original comment by satyamsh...@gmail.com
on 7 Dec 2011 at 6:57
More often than not, this is what would happen (could be wrong).
1. Client sends request with RID=10
2. Client sends request with RID=11
3. Server sends response to RID=10 and RID=11 (on separate responses for
whatever reason), but 10 fails due to some reason.
4. Client sends request with RID=12 (since the user wants to send a message)
5. Server re-sends the response of RID=10 on RID=12
This is out of order, but it doesn't cause a problem since the client will
anyway reorder the responses before showing them to the user (or processing
them in any way).
OTOH, if the server didn't do this, the client would have to explicitly
re-request RID=10, incurring an extra round-trip overhead.
Original comment by dhruvb...@gmail.com
on 7 Dec 2011 at 11:55
I dont think the spec says that you can send a response for RID 10 on
request for RID 12. -- meaning the client might not rely on the RID sent
from the server.
Original comment by satyamsh...@gmail.com
on 7 Dec 2011 at 12:22
My bad, I didn't recall what the behaviour would be. This is the scenario:
1. Client sends request with RID=10
2. Client sends request with RID=11
3. Server sends response to RID=10 and RID=11 (on separate responses for
whatever reason), but 10 fails. We assume that 11 also fails (this is an
incorrect assumption, but works most of the time in practice). Alternatively,
since hold is almost always 1, there is no held request on which 11 can be
sent, so we are good.
4. Client sends request with RID=12 (since the user wants to send a message)
5. Server re-sends the response of RID=10 on RID=11
6. Request with RID=13 is sent and server sends back response with RID=12. The
client need not know the responses are offset by 1 since when the client
re-requests RID=10, it will get back an empty body. As far as the client is
concerned, the stream is consistent.
The reason that this is an optimization is that many clients don't bother
re-requesting a failed request and just move ahead. In such a case, we send all
the data from the server to the client with a higher probability than if we
rely on the client re-requesting the failed transaction. Hence, on the average,
we hope to be more reliable by detecting network failures at the server end and
taking corrective action there itself. Of course, there is a non-zero
probability of out of order responses for correctly behaving clients.
In case of proxies, we rely on the proxy faithfully (and almost immediately)
reporting back with a failure for this to work.
This can in theory lead to out of order responses, but I've not yet seen any,
so this is more of a pragmatic thing that a pedantic thing.
Original comment by dhruvb...@gmail.com
on 7 Dec 2011 at 1:13
> We assume that 11 also fails (this is an incorrect assumption, but works most
of the time in practice). Alternatively, since hold is almost always 1, there
is no held request on which 11 can be sent, so we are good.
Why is this assumption incorrect?
> The reason that this is an optimization is that many clients don't bother
re-requesting a failed request and just move ahead.
This sounds wrong to me though I dont have any evidence against it. All the
clients that we have written here dont do this.
Also, I don't think messages are removed from this.unacked_responses -- leading
to duplication when the client re-requests that RID.
Original comment by satyamsh...@gmail.com
on 7 Dec 2011 at 4:00
>> We assume that 11 also fails (this is an incorrect assumption, but works
most of the time in practice). Alternatively, since hold is almost always 1,
there is no held request on which 11 can be sent, so we are good.
> Why is this assumption incorrect?
Because sometimes, hold might be > 1 (though I've never seen this configuration
in the wild).
>> The reason that this is an optimization is that many clients don't bother
re-requesting a failed request and just move ahead.
> This sounds wrong to me though I dont have any evidence against it. All the
clients that we have written here dont do this.
True, though this was done when we didn't have so many clients! Besides, the
behaviour is based more on the union of all the popular clients out there.
If this is causing trouble, then there are serious network issues that need to
be fixed rather than fixing this since the probability of such errors occurring
on a relatively decent network is very low. This is because the failure window
is if the connection is lost while waiting for a response from the xmpp proxy.
On the average, if it takes time wait/2, which is significantly greater than
RTT, then the case that is handled covers a much larger area of the graph. Only
if the network connection is lost post sending the response and the response
didn't get delivered to the client will there be a problem.
> Also, I don't think messages are removed from this.unacked_responses --
leading to duplication when the client re-requests that RID.
This shouldn't happen - if it does, it's an error in the implementation.
unacked_responses isn't used for streams that don't ACK. Additionally,
unacked_responses is cleared for streams that do ACK.
Original comment by dhruvb...@gmail.com
on 7 Dec 2011 at 4:20
> Because sometimes, hold might be > 1 (though I've never seen this
configuration in the wild).
Oh ok.. now I get it. I thought u meant 11 couldn't fail. :)
> Besides, the behaviour is based more on the union of all the popular clients
out there.
Is this a standard that every bosh server follows? If it is not, it can lead to
message loss for these clients. To me it just doesn't sound right.
No, it is not causing any trouble yet. I am just trying to understand the
reasoning behind it a bit better. :)
> This is because the failure window is if the connection is lost while waiting
for a response from the xmpp proxy.
xmpp proxy? I didn't get this at all. How s wait time related to response from
xmpp proxy? From what I understand, the failure window is whenever there is an
error event on response object - which I think should only happen when there is
a write error on the underlying socket -- implying a broken connection. Here I
am making the assumption that the xmpp server is continuously sending data.
Also, the connection is pretty weak for EDGE so I wouldn't be surprised if it
creates issues.
Original comment by satyamsh...@gmail.com
on 7 Dec 2011 at 8:31
>> Besides, the behaviour is based more on the union of all the popular clients
out there.
> Is this a standard that every bosh server follows? If it is not, it can lead
to message loss for these clients. To me it just doesn't sound right.
I haven't seen the guts of other implementations, so I can't really say. This
sounded reasonable to me then so that's how it turned out. If you have data to
show that it's a bad choice (or worse than some other alternative), feel free
to fix it!!
> No, it is not causing any trouble yet. I am just trying to understand the
reasoning behind it a bit better. :)
>> This is because the failure window is if the connection is lost while
waiting for a response from the xmpp proxy.
> xmpp proxy? I didn't get this at all. How s wait time related to response
from xmpp proxy? From what I understand, the failure window is whenever there
is an error event on response object - which I think should only happen when
there is a write error on the underlying socket -- implying a broken
connection. Here I am making the assumption that the xmpp server is
continuously sending data. Also, the connection is pretty weak for EDGE so I
wouldn't be surprised if it creates issues.
Sorry, I meant "xmpp server". If your comment is relevant event now, I shall
try to parse it ;)
If you measure the mean, and median median wait times for a response object
(time between which it was received and between which it left), you'll be
surprised to find it to be closer to the upper bound (wait value) than he mean
of the wait value.
Original comment by dhruvb...@gmail.com
on 7 Dec 2011 at 8:52
Oh now I understand the point you are trying to make. However, the failure
window is not necessarily that small. I might be totally wrong on this but the
error event might be emitted late(when trying to write on the socket). So, no
matter when the connection breaks we ll get to know only when we try writing
out a response on that socket.
Original comment by satyamsh...@gmail.com
on 7 Dec 2011 at 9:59
> the error event might be emitted late(when trying to write on the socket).
So, no matter when the connection breaks we ll get to know only when we try
writing out a response on that socket.
Irrespective of when the error event is sent, we always run into trouble since
failed packets are appended to the pending list on failure. This inherently
destroys their relative order and hence the sanity of the stream.
The 100% guaranteed correct way of doing it is what you are suggesting - send
and put into sent buffer - and then resend if re-request. The only problem is
that clients don't always re-request failed responses.
Original comment by dhruvb...@gmail.com
on 8 Dec 2011 at 12:18
So you can see that there is a tradeoff between being 100% correct for clients
that themselves are 100% correct and surely failing for the others OR being 99%
correct for 100% of the clients. Don't know which one is better. If we have
data on what weighted (by usage) fraction of clients are correct in
re-requesting failed packets, we can make a more informed choice.
Original comment by dhruvb...@gmail.com
on 8 Dec 2011 at 12:20
> Irrespective of when the error event is sent, we always run into trouble
since failed packets are appended to the pending list on failure. This
inherently destroys their relative order and hence the sanity of the stream.
Yes, but I thought we were debating how frequent this will be. I am of the
opinion that it doesn't matter when connection breaks because we ll only get to
know about it when we try writing on the socket. So, the failure window is
independent of time when the disconnection occurs (since we get to know only
when we try writing on it) -- making it pretty wide.
> The only problem is that clients don't always re-request failed responses.
This shakes the foundations of the spec for me. :) .. which client does this?
This has many caveats. Not only is it vulnerable for out of order messages, but
also message loss The server can't know for sure whether the response was
actual received by the client. If the client doesn't re request those rid it
might lose some messages.
> So you can see that there is a tradeoff between being 100% correct for
clients that themselves are 100% correct and surely failing for the others OR
being 99% correct for 100% of the clients.
This is sounds a bit exaggerated. :).. Firstly, the failure % will be same for
those -- let's say "optimized" -- if we change the implementation. Secondly, I
very highly doubt 100% of the clients do this, that is everyone except us. :)
Original comment by satyamsh...@gmail.com
on 8 Dec 2011 at 2:45
>> Irrespective of when the error event is sent, we always run into trouble
since failed packets are appended to the pending list on failure. This
inherently destroys their relative order and hence the sanity of the stream.
> Yes, but I thought we were debating how frequent this will be.
True, for decent networks, this shouldn't be much. Besides, this case is
triggered ONLY if there are multiple responses waiting to be written.
Generally, the length of the pending responses queue is 1. So to get the
failure probability, you need to compute the probability that the response was
dropped AND the length of the pending responses is greater than 1.
> I am of the opinion that it doesn't matter when connection breaks because we
ll only get to know about it when we try writing on the socket.
Yes, before we write to the socket, we can't say that the write failed. That
apart, in some scenarios, we won't know if a write failed a long time after we
actually do the write. In fact, we can't set a reasonable upper bound on the
time to wait since it is a tradeoff between either this or performance
(responsiveness).
> So, the failure window is independent of time when the disconnection occurs
(since we get to know only when we try writing on it) -- making it pretty wide.
Not really. More often than not, if the failure occurs before the send takes
place, the 'error' event is triggered almost instantly. So, the connection can
be broken any time before we do the write. We can compute the probability this
way:
P(error raised instantly) = time interval between send and object received /
wait time
>> The only problem is that clients don't always re-request failed responses.
> This shakes the foundations of the spec for me. :) .. which client does this?
Don't remember, but I had tested a bunch of clients to use for the
node-xmpp-bosh and strophe.js behaved correctly most of the time - which is why
it was chosen as the test harness.
> This has many caveats. Not only is it vulnerable for out of order messages,
but also message loss The server can't know for sure whether the response was
actual received by the client.
BOSH is a transparent proxy. Servers generally don't care about an XMPP proxy.
> If the client doesn't re request those rid it might lose some messages.
True - some people are okay with this since it doesn't control missles!!
>> So you can see that there is a tradeoff between being 100% correct for
clients that themselves are 100% correct and surely failing for the others OR
being 99% correct for 100% of the clients.
> This is sounds a bit exaggerated. :).. Firstly, the failure % will be same
for those -- let's say "optimized" -- if we change the implementation.
Secondly, I very highly doubt 100% of the clients do this, that is everyone
except us. :)
Obviously, the number have been invented!!
Original comment by dhruvb...@gmail.com
on 8 Dec 2011 at 9:40
> True, for decent networks, this shouldn't be much. Besides, this case is
triggered ONLY if there are multiple responses waiting to be written.
Generally, the length of the pending responses queue is 1. So to get the
failure probability, you need to compute the probability that the response was
dropped AND the length of the pending responses is greater than 1.
No, every time a response fails we requeue it, irrespective of the length of
pending responses. (_on_no_client also requeues it).
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js#752
> More often than not, if the failure occurs before the send takes place, the
'error' event is triggered almost instantly.
This I didn't know. I thought socket disconnections are very difficult to find
out -- and you do that by writing onto it. If the socket doesn't send a FIN and
closes its end the server might not know anything about it, and no error event
would be raised.
>> This has many caveats. Not only is it vulnerable for out of order messages,
but also message loss The server can't know for sure whether the response was
actual received by the client.
> BOSH is a transparent proxy. Servers generally don't care about an XMPP proxy.
Sorry, I meant BOSH server. An HttpServer can't know for sure whether the
client received the response or not. It might not receive any error event even
if the delivery failed(and hence not requeue the response). If the clients are
not re requesting every RID that response is lost.
Original comment by satyamsh...@gmail.com
on 8 Dec 2011 at 10:39
> No, every time a response fails we requeue it, irrespective of the length of
pending responses. (_on_no_client also requeues it).
http://code.google.com/p/node-xmpp-bosh/source/browse/trunk/src/session.js#752
Yes, but the failure (to operate correctly, which in this case is give messages
in-order) happens when there is > 1 response to send. If there is just 1
response to send, en-queuing at at the back of the queue is correct. If there
are 2 responses (say), then you've effectively swapped their positions, which
is wrong (which is what currently happens)!
> It might not receive any error event even if the delivery failed(and hence
not requeue the response). If the clients are not re requesting every RID that
response is lost.
You are guaranteed to receive an 'error' event if sending failed. This is one
of the basic building blocks network server work on! I think what you mean is
that you might receive an error event even if sending succeeded - right (which
can happen - again, with a very small probability)?
Either ways, clients are supposed to re-request a failed RID.
Original comment by dhruvb...@gmail.com
on 8 Dec 2011 at 11:50
> Yes, but the failure (to operate correctly, which in this case is give
messages in-order) happens when there is > 1 response to send. If there is just
1 response to send, en-queuing at at the back of the queue is correct. If there
are 2 responses (say), then you've effectively swapped their positions, which
is wrong (which is what currently happens)!
What about the case when there is only one response but the error event is
raised late.
1. response for RID 10 fails.
2. request RID 11 arrives
3. response for 11 sent (which will be succeed).
4. error on response for RID 10.
>> It might not receive any error event even if the delivery failed(and hence
not requeue the response). If the clients are not re requesting every RID that
response is lost.
> You are guaranteed to receive an 'error' event if sending failed. This is one
of the basic building blocks network server work on! I think what you mean is
that you might receive an error event even if sending succeeded - right (which
can happen - again, with a very small probability)?
Pardon the lack of clarity on my part. The error event will always be received.
You are right. But again like you mentioned this also exposes it to duplication.
>> More often than not, if the failure occurs before the send takes place, the
'error' event is triggered almost instantly.
> This I didn't know. I thought socket disconnections are very difficult to
find out -- and you do that by writing onto it. If the socket doesn't send a
FIN and closes its end the server might not know anything about it, and no
error event would be raised.
Can you elaborate on this. I can't seem to reason why/how the 'error' event
will be triggered almost instantly.
> Either ways, clients are supposed to re-request a failed RID.
To summarize we are optimizing this for a scenario where the network connection
is bad. But if the network connection is bad, we are also exposing ourselves to
duplication and incorrect order of stanzas. Isn't simpler/straight forward
solution without many edge cases better? If the clients re-request a failed RID
the level of optimization goes down anyways. Do you think we should get rid of
this?
Original comment by satyamsh...@gmail.com
on 9 Dec 2011 at 8:18
>>> More often than not, if the failure occurs before the send takes place, the
'error' event is triggered almost instantly.
>> This I didn't know. I thought socket disconnections are very difficult to
find out -- and you do that by writing onto it. If the socket doesn't send a
FIN and closes its end the server might not know anything about it, and no
error event would be raised.
> Can you elaborate on this. I can't seem to reason why/how the 'error' event
will be triggered almost instantly.
Oh, you meant instantly when we try to write on the socket? Even if that is the
case the responses will get reordered if there is a pending response. Now that
we stitch packets right before sending response we probability of incorrect
ordering is reduced since the stanzas coming from xmpp-server are queued
separately. Before doing this, incorrect ordering would have been independent
of the number of pending responses.
Original comment by satyamsh...@gmail.com
on 9 Dec 2011 at 8:36
Okay, let's try to map this to a well known problem in probability.
You are standing at a bus stop between 1:00pm & 1:45pm, waiting for a bus. The
bus could come form anywhere between 1:00pm & 2:00pm. What is the probability
that you get it?
The answer of course is 3/4 or 0.75 since that is the area of the 2 pie-charts
(of you and the bus) that intersect divided by the area of the pie-chart of the
bus.
Analogously, if we apply the same reasoning to our domain, if a response object
comes in at t=10. Let's suppose that wait=20 sec. If our observations say that
on the average, a packet stays with us for T=17 sec, then what is the
probability that the connection was lost while we were holding it? It is
17/(17+RTT). Here, RTT is a round trip time, which is the time required for a
packet to come to us and return back to the sender. RTT is generally of the
order of 300-500ms on decent networks. If the connection breaks any time before
we actually do the send, we will be notified of the error as soon as we do the
send (depending on the actual implementation, this may be in the same tick or
the very next tick in node.js). (the numerator is NOT (RTT/2 + 17) because the
connection did NOT break on our way to us).
Suppose we send out a response at T=17 sec and the connection breaks _after_ we
send it out? What is the probability of such an event occurring? It is
(RTT/2)/(17+RTT), since the connection could break at any point on the return
journey.
We will fail if the XMPP server sends us a packet in that window of RTT/2 sec
because if we are notified of the failure after we have buffered the packet
from the XMPP server, then we will be inserting the original packet at the back
of the queue of buffered packets.
Of course, we will never have more than 1 XMPP packet to send to the client at
any time since we are merging them all together before sending them.
This analysis is valid for one stream. We can reorder responses at will across
streams and not affect the correctness. Make sense?
The error even is fired almost instantly since the OS knows that the socket is
disconnected (if the client actually closed it), but will not raise an event
unless you try to send something on it.
Of course the correct way to implement it would be to buffer the response on
send and resend when it only when it is re-requested, and it _should_ be done
that way.
Original comment by dhruvb...@gmail.com
on 9 Dec 2011 at 1:26
Thanks for the detailed analysis. :) .. Will make the change in the next commit.
Original comment by satyamsh...@gmail.com
on 11 Dec 2011 at 5:21
Original comment by satyamsh...@gmail.com
on 5 Jan 2012 at 8:48
Original issue reported on code.google.com by
satyamsh...@gmail.com
on 6 Dec 2011 at 9:50