hongbinz / imsdroid

Automatically exported from code.google.com/p/imsdroid
0 stars 0 forks source link

intermittent socket failures #166

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. w/ latest IMSDroid code, I toggle registration off & on
2. intermittently I get socket errors, see debug trace below

What is the expected output? 

No socket errors

What do you see instead?

Just toggling registration, no changes in cellular or WLAN data coverage (WiFi, 
3G or HSPA+). The problem can occur both on WiFi or cellular data. Then 
inexplicably register/deregister will start working again.

What version of the product are you using? On what operating system?

Latest version IMSDroid, latest version of Doubango compiled with NEON=yes (I 
am using my own build of Doubango stack, will go back and try the public 
version checked into SVN)

Please provide any additional information below.

Here are 2 traces, I am happy to provide my tinyWrap.so build, if that would 
help isolate. It seems like the network connectivity state is related.

01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_get_ip_n_port()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_utils.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "920" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: sendto have failed.
01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: (SYSTEM)NETWORK ERROR ==>Socket 
operation on non-socket
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_get_ip_n_port()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_utils.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "920" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: sendto have failed.
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: (SYSTEM)NETWORK ERROR ==>Socket 
operation on non-socket

I did add some more trace to get the value of socket file descriptor. Here is 
what the trace outputs now:

02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_utils.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "924" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1, fd = 45
02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: sendto have failed.
02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number
02-04 02:09:01.521: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): file: "src/tnet_utils.c" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): line: "924" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1, fd = 45
02-04 02:09:01.521: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): MSG: sendto have failed.
02-04 02:09:01.523: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number

Original issue reported on code.google.com by rich.ho...@gmail.com on 4 Feb 2011 at 7:27

GoogleCodeExporter commented 9 years ago
I should point out a couple other testing scenarios I tried prior to reporting:

1. I tested on Samsung Galaxy S (GT-I9000T) running Eclair 2.1 update 1, occurs 
equally on WiFi and 3G

2. I tested on MyTouch 4G (HTC Glacier) running Froyo 2.2, occurs less 
frequently on 3G and difficult to reproduce borken sockets on WiFi

Original comment by rich.ho...@gmail.com on 4 Feb 2011 at 7:35

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Using EITHER the precompiled build or a custom Doubango stack I found the 
problem. It's Doubango's STUN implementation breaking the Android native 
sockets intermittently.

Leave STUN off, sockets are stable...

Turn STUN on, and eventually you will get the trace below...

When will Doubango have a STUN implementation that has a NAT pinhole keepalive 
& not break the Android native sockets like below? (let me know if I am asking 
the wrong forum, be happy to help fix/test corrections...)

02-04 21:29:03.871: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Stack stoped
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
realm='sip:share.aylus.com', impu='sip:#####@foo.bar', impi='#####@foo.bar'
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
STUN=yes
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
STUN2 - server=numb.viagenie.ca and port=3478
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
pcscf-host='#.#.#.#', pcscf-port='5060', transport='UDP', ipversion='ipv4'
02-04 21:29:04.551: 
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): 26.180.59.20
02-04 21:29:04.551: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Local IP='26.180.59.20'
02-04 21:29:04.551: 
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): netType=0 and 
netSubType=8
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Stack started
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
OnDialogEvent (Dialog connecting)
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Main(1961): onRegistrationEvent
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Screens.ScreenHome(1961): 
onRegistrationEvent
02-04 21:29:04.571: 
INFO/org.doubango.imsdroid.Services.Impl.ServiceManager(1961): 
onRegistrationEvent
02-04 21:29:04.571: 
DEBUG/org.doubango.imsdroid.Services.Impl.ServiceManager(1961): 
showRegistartionNotif
02-04 21:29:04.671: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
OnDialogEvent ((un)REGISTER request successfully sent.)
02-04 21:29:04.741: DEBUG/skia(1961): ===== ashmem pin_region(43) returned -1, 
treating as error 9
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_utils.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "920" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: sendto have failed.
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number

Original comment by rich.ho...@gmail.com on 5 Feb 2011 at 2:42

GoogleCodeExporter commented 9 years ago
I found the problem in Doubango 'src/tnet_transport_poll.c, after line 671'. 
The answer to your FIXME question is 'NO', if ioctl for FIONREAD returns 0 or 
greater but the &len is set to '0' bytes, an Android native socket decriptor 
has NOT been closed. It's still a viable socket descriptor and SHOULD NOT be 
closed, even if it needed to be closed, Doubango should fire off an EVENT_ERROR 
for the IMSDroid to decide what to do. EVENT_CLOSE confuses IMSDroid and keeps 
trying to register, creating later socket errors on a dead socket. For an 
interim fix since this error is slowing me down here is what I modified in the 
Doubango tree for you to consider. My tweak sends the warning still, disables 
the event_close and leaves the native Android socket open still. I will let you 
know if I have anymore problems with this approach but it appears to be 
working...

==== src/tnet_transport_poll.c, around line 671 =====
    if(!len){
        TSK_DEBUG_WARN("IOCTLT returned '0' bytes for fd=%d", active_socket->fd);
/*FIXME: On Android/MAC OS X this mean that the socket has been closed?  */
//#if defined(ANDROID) || 1 
//      TSK_RUNNABLE_ENQUEUE(transport, event_error, transport->callback_data, 
active_socket->fd);
//      removeSocket(i, context);
//#else
        recv(active_socket->fd, 0, 0, 0);
//#endif
        continue;

Original comment by rich.ho...@gmail.com on 5 Feb 2011 at 8:13

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
One other concern you may want to consider adding. Since Doubango is polling 
and pausing sockets, a thread optimization in the polling section to sleep 
150ms (or less) while the socket is paused, saves battery, CPU and uneeded 
debug trace lines. I used to see "Socket is paused" for up to 50 cycles waiting 
for STUN, now I see "Socket is paused, sleeping for 150ms" just once:

================
    /* check whether the socket is paused or not */
    if(active_socket->paused){
        TSK_DEBUG_INFO("Socket is paused, sleeping for 150ms");
        tsk_thread_sleep(150);
        goto TNET_POLLIN_DONE;
    }

Original comment by rich.ho...@gmail.com on 6 Feb 2011 at 12:55

GoogleCodeExporter commented 9 years ago
You have provided a great feedback but I think there are many confusions.

We are not polling and pausing the sockets as you may think. The framework uses 
poll() [http://linux.about.com/library/cmd/blcmdl2_poll.htm] function which is 
blocking and works like select(). If you prefer using select() instead of 
poll() then, set "TNET_HAVE_POLL" macro value to zero. Please see: 
http://code.google.com/p/doubango/source/browse/trunk/tinyNET/src/tnet_poll.c

"Pausing" a socket doesn't mean what you think. A socket is only paused when 
STUN is enabled. 
Explication:
a) When a RTP socket is created then, we add it to the map managed by the main 
transport thread (T1) using poll() or select(). 
b) When STUN is enabled then, a new NAT context is created to retrieve the 
public ip:port. This context will create a new thread (T2) to send/receive and 
parse STUN (or TURN) messages.
c) As T1 and T2 use the same socket then, both will receive notifications when 
there is data to read. As you cannot read the socket two times then, we mark T1 
as paused. In your case this means that you have received 50 RTP/STUN packets 
before STUN process complete.
It's really not a good idea to add the sleep(150).

Original comment by boss...@yahoo.fr on 6 Feb 2011 at 3:51

GoogleCodeExporter commented 9 years ago
Agreed I did look into the poll function and see the standard 'select() on 
managed file descriptors' and agree that is the correct approach (didn't mean 
to imply Doubango was using a 'polling' non-yielding thread approach).

More specifically what I am pointing out is that while a socket is 'paused' by 
STUN, Doubango gets about 50 or so 'INFO' messages for select() alerts. That 
was the slight inefficieny I was referring to - understand it only happens 
while waiting for STUN. I was suggesting that a yield or sleep of 150 msec 
(similar to what is done while waiting for DNS) might remove the unhandled 
select() alerts.

I also went ahead and implemented a periodic STUN refresh for Win32 and 
Android. I believe it's reasonably portable to iOS and Linux. I will attach it 
kindly for your review/consideration later today, cheers.

Original comment by rich.ho...@gmail.com on 6 Feb 2011 at 2:55

GoogleCodeExporter commented 9 years ago
Any news about the patch?

Original comment by boss...@yahoo.fr on 10 Feb 2011 at 12:05

GoogleCodeExporter commented 9 years ago
My apologies been on the road for 3 weeks and traveling again next week...

Attached is a proposal to consider for STUN refresh on just the SIP transport. 
For carriers that NAT the mobile devices on 3G, this keeps the NAT UDP pinholes 
open to keep the device reachable by the serving CSCF.

You can ignore the function:
TINYNET_API unsigned long tnet_transport_gettick();

I am resuing the existing tsk_time_epoch() which is already portable across O/S.

Original comment by rich.ho...@gmail.com on 12 Feb 2011 at 2:47

Attachments:

GoogleCodeExporter commented 9 years ago
@rich.ho...@gmail.com
Is it possible to have the changes as patches. It's hard to find the modified 
code.

Original comment by boss...@yahoo.fr on 19 May 2011 at 5:17

GoogleCodeExporter commented 9 years ago

Original comment by boss...@yahoo.fr on 19 May 2011 at 8:58

GoogleCodeExporter commented 9 years ago
I'm seeing this even when STUN is off.

Original comment by kjeremy@gmail.com on 30 May 2013 at 8:17