Open GoogleCodeExporter opened 8 years ago
Attached logs.
Original comment by classi...@floodgap.com
on 24 Apr 2011 at 3:57
Attachments:
Error -3155 appears to be a Open Transport socket connection error (meaning no
likely fix in Bugzilla :( ). There are lots of reports in reference to it with
other software:
OE 5, IE 5: http://forums.macnn.com/82/applications/64749/network-error-3155-a/
Entourage: http://support.microsoft.com/kb/883725
There is an intriguing note about MTUs in it. Unfortunately, I don't think we
can reasonably expect users to fiddle with that.
Per http://support.apple.com/kb/TA35484?viewlocale=en_US it is kOTOutStateErr.
http://serenity.uncc.edu/web/ADC/2005/Developer_DVD_Series/April/ADC%20Reference
%20Library/documentation/Carbon/Reference/Open_Transport/open_transport_ref/cons
tant_190.html :
kOTProviderIsClosed
The provider has closed. The reason for being closed can be found in the OTResult value passed to your notifier. The reasons typically are kOTPortHasDiedErr, kOTPortWasEjectedErr, or kOTPortLostConnectionErr. At this point, any calls other than OTCloseProvider will fail with a kOTOutStateErr error.
This looks like NSPR or even GUSI-level.
Original comment by classi...@floodgap.com
on 24 Apr 2011 at 4:12
Probably not related, but useful for background:
https://bugzilla.mozilla.org/show_bug.cgi?id=2133
Original comment by classi...@floodgap.com
on 24 Apr 2011 at 4:17
More background: https://bugzilla.mozilla.org/show_bug.cgi?id=28906
Original comment by classi...@floodgap.com
on 24 Apr 2011 at 4:38
Meta: https://bugzilla.mozilla.org/show_bug.cgi?id=19119
Since this is timing related, we should fix issue 162 first, and see what
difference that makes.
Original comment by classi...@floodgap.com
on 24 Apr 2011 at 4:48
In the meantime, altered the DNS to point directly to floodgap.com and let
HTTPi sort it out. This wallpapers the problem at least for us. tenfourfox.com
will now become our test case.
Original comment by classi...@floodgap.com
on 30 Apr 2011 at 5:21
Theory, aided in part by
http://www.mactech.com/articles/develop/issue_27/macqa.html
This is a fairly subtle race condition in NSPR. Looking at
nsprpub/pr/src/md/mac/macsockotpt.c ; our OpenTransport notifier is in
NotifierRoutine().
1. We connect, transmit our string.
2. Other side transmits data, immediately disconnects (sends FIN).
3. We get first a T_DATA (and data is available), then a T_ORDREL, in the
notifier.
4. In the state where this works, the T_ORDREL is sufficiently delayed to allow
the asynchronous read to get the data first with OTRcv(). T_ORDREL comes along
and the notifier calls OTRcvOrderlyDisconnect(), closing the connection.
Further reads on that stream will give us the hated -3155, but this is okay
because we have the data.
5. In the state where this fails, T_ORDREL is handled before the calling thread
has a chance to read the data. Since the connection is disconnected at the
notifier level, then T_ORDREL effectively has a higher priority than T_DATA,
and effectively masks the data by closing the stream before the thread gets to
it.
We will only fix this for reads, since writes don't seem to have an issue,
though theoretically they might have a similar phenomenon.
This *should* work:
- When the notifier gets T_ORDREL, it should check OTCountDataBytes(endpoint,
&count); to see if all data was read.
- If it was, then it will disconnect like currently and life goes on.
- If it was not, it will mark the socket as in a closing state (probably
fd->secret->something, which by default will be false) and will not disconnect
it. It then wakes up the read thread.
- The read thread then calls SendReceiveStream() as usual. It calls OTRcv() and
presumably gets all the data that is available (we might have to write a
special tight blocking read for this). At the end of SendReceiveStream(), the
call looks at the state of the socket. If fd->secret->something is true, then
it calls OTRcvOrderlyDisconnect() itself, discarding the error since we don't
care.
Original comment by classi...@floodgap.com
on 28 Jun 2011 at 12:09
I was hoping Copperweb could let me simulate this on my internal network but no
dice.
Original comment by classi...@floodgap.com
on 1 Jul 2011 at 4:34
Well, did some initial work on this. First, there is already an
fd->secret->orderlyDisconnect and we will simply (ab)use that as our signal to
close the socket within SendReceiveStream(). So far this has not broken
anything and there are no unusual increases in leakage. If everything else is
good then I will pop out a test build and try to replicate this on the iBook.
Original comment by classi...@floodgap.com
on 1 Jul 2011 at 5:10
Original comment by classi...@floodgap.com
on 1 Jul 2011 at 5:20
It's not T_ORDREL. It's T_DISCONNECT, with a disconnect error of 54. This is
ECONNRESET = 54; { Connection reset by peer } (which makes sense).
We know IE can deal with this, so there must be a workaround.
Original comment by classi...@floodgap.com
on 3 Jul 2011 at 11:58
So the current approach hasn't worked at all; we still have a serious race
condition.
Realistically all I can think of is to build a queue of network states and have
NSPR access it. This is a big hack but may work.
Original comment by classi...@floodgap.com
on 30 Aug 2011 at 3:39
For the queue, it should consist of the event type (T_ORDREL, T_DISCONNECT,
etc.), any useful info, a bookmark and a pointer to a buffer. The queue should
be circular and probably should include at least 15 entries. There should be a
"thread is at this point" and a "queue is at this point".
In the notifier, an event occurs, the notifier stuffs the state and grabs the
buffer with OTRcv, if any, and the buffer is created with an OTAllocMem so that
we can put it in temporary memory. The bookmark is set to zero. We then wake up
the NSPR read thread like usual. This only needs to be done for read states.
In the read thread, the thread then goes through all of the pending states that
occurred since it was last awakened. If it needs to read data and there is MORE
data available than it wanted, it drops a bookmark saying where it left off in
the buffer, and goes back to sleep after returning the data. Otherwise, it
copies the buffer, OTFreeMems it to give it back, and returns that. If there
are more states, it keeps going, including to our T_DISCONNECT. This *should*
ensure that no data is dropped because the notifier will always get a chance.
If the notifier is still not fast enough, then I have no idea how to solve this.
Original comment by classi...@floodgap.com
on 30 Aug 2011 at 3:48
er, copies from the bookmark to the end of the buffer, but OTFreeMems the whole
buffer
Original comment by classi...@floodgap.com
on 30 Aug 2011 at 3:51
Scheduling for 9.3.0 instead since 9.2.3 is picking up other critical issues
Original comment by classi...@floodgap.com
on 10 Oct 2011 at 9:03
We can create a test that creates just such a queue but doesn't allocate memory
to it (this is the tricky part). All the notifier would do is stuff events. If
the read thread sees that it is missing events, then we know the notifier is
fast enough to queue things. We can then add the alloc and read logic.
The real trick is simulating this internally.
Current algorithm thought:
thread_is_at = 0
queue_is_at = 0
notifier:
get status
stuff status in new queue entry
#if FOR REALSIE
if read, { OTAllocMem and OTRcv into that buffer
save length of buffer
set buffer bookmark to 0 }
#endif
increment queue_is_at
wake up read thread
do again for all statuses
read thread (when it awakens)
if queue_is_at < thread_is_at {
get current queue entry
if read event, {
#if FOR REALSIE
if length-bookmark > bytes desired { bcopy from bookmark to bookmark + bytes
desired, bookmark += bytes desired }
else { bcopy from bookmark to length, OTFreeMem buffer, increment thread_is_at }
#else
notify, do OTRcv ourselves as we do now, increment thread_is_at
#endif
}
if other event, { standard handler, increment thread_is_at }
ONLY DO ONE SUCH QUEUE ENTRY to maintain NSPR calling convention
return
}
Original comment by classi...@floodgap.com
on 3 Nov 2011 at 1:28
all increments would be (x++) & 15
Original comment by classi...@floodgap.com
on 3 Nov 2011 at 1:29
We could make this simpler, maybe. Stuff the last T_DATA event only (instead of
a rotating queue). When T_DISCONNECT comes around, or T_LOOK, have it check the
last T_DATA event and salvage what it can. This might be "good enough."
Original comment by classi...@floodgap.com
on 3 Dec 2011 at 3:28
This didn't work, so we'll have to go back to the old idea (in
macsockotpt.c.9.2.3). However, we don't need the queue, just deferredDi, as
before.
Original comment by classi...@floodgap.com
on 3 Dec 2011 at 4:43
This is holding up 9.3.0 and I can't get it to not be flaky. I need to rethink
this. Restoring the 9.2.3 code and we'll keep this rolling.
Original comment by classi...@floodgap.com
on 31 Dec 2011 at 10:41
ifixoldmacs now works ... !
Original comment by classi...@floodgap.com
on 12 Jan 2012 at 3:00
Well, it *did*. Must have been a blip.
Original comment by classi...@floodgap.com
on 17 Jan 2012 at 3:43
[deleted comment]
I wonder if using OTMP instead of notifiers would be a good idea, though this
would require a bump of the system requirements up to 9.0.
Original comment by yuhongba...@hotmail.com
on 21 Jan 2015 at 9:06
Original issue reported on code.google.com by
classi...@floodgap.com
on 24 Apr 2011 at 3:56