x86 X11 problem with interrupted system call

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. start inferno with jit enabled (-c1)
2. start a bunch of apps (polyhedra, coffee, bounce)
3. keep doing window ops until it breaks

What is the expected output? What do you see instead?
Expect things to work, what I saw was this:
XIO:  fatal IO error 4 (Interrupted system call) on X server ":0.0"
      after 12099 requests (12097 known processed) with 0 events remaining.

What version of the product are you using? On what operating system?
SVN checkout as of June 2, 10:30am CST

Please provide any additional information below.
Don't think this is related to previous problems.  Doesn't seem to happen
without the JIT on

Original issue reported on code.google.com by eri...@gmail.com on 2 Jun 2008 at 3:54

GoogleCodeExporter commented 9 years ago

I've just seen that without the JIT, so i don't think it's JIT related.
i think it's the use of signals to control scheduling  of the vm and kills 
within it
for linux.
the most likely thing is that one of those signals catches and interrupts a 
write to
the X server,
but it only happens when an underlying socket or queue fills up, hence the need 
for a
bit of load;
and the signal needs to arrive during that particular write.

Original comment by Charles....@gmail.com on 20 Mar 2009 at 4:04

Changed title: x86 X11 problem with interrupted system call
Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I use acme pretty heavily for development, and I experience the X errors 
outlined here, as well as in Issue 188, pretty consistently throughout the 
day... work a bit, end up with a pretty big/busy acme window, and eventually 
this error occurs sooner or later,without fail.

I'm running debian lenny, 2.6.26-2-686 (not 64 bit), with the tgz on 
www.vitanuova.com from 20100120... I've also tried with an up-to-date build (as 
of a few days ago, before I went back to the packaged download assuming it was 
more stable), and same deal...

trying a fresh build now, but it seems like the issue hasn't been completely 
fixed yet, based on the issues/comments in here?

I'd love to help fix this, but unfortunately I've no experience w/ X11 or 
inferno (OS) development... I like to think I'm a pretty good programmer 
though, so despite that if someone (Charles?) wants to point me to a starting 
point I'd be happy to start hacking away at it and see if I can figure out 
what's going on... in the absence of a reply, I'm going to see what I can do 
anyways - I've been meaning to learn the inferno src and internals for a while 
now, this is a good excuse to do so...

Original comment by datawh...@gmail.com on 10 Jun 2010 at 5:04

GoogleCodeExporter commented 9 years ago

i suspect there should be osenter/osleave calls surrounding X11 and MacOSX 
function calls in the various win.c when those can be called from devdraw.c. 
equivalently, and perhaps easier, SIGUSR1 and perhaps other signals should 
simply be blocked during those calls. there's no way of knowing how those calls 
are implemented, and if the SIGUSR1 interrupt hits a read or write system call 
(or some other interruptible system call used by the host library), i think it 
will break.

Original comment by Charles....@gmail.com on 27 Jul 2010 at 11:01

GoogleCodeExporter commented 9 years ago

Original comment by Charles....@gmail.com on 27 Jul 2010 at 11:01

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

that was incorrect: unless osenter/osleave <i>are</i> called, the process 
shouldn't be receiving the interrupt signal (SIGUSR1), so that earlier 
suggestion was the wrong way round. even so, i've got the feeling the answer is 
along those lines: the system calls making the updates for flushmemscreen are 
being interrupted (at least on Linux), and the action i'd just taken in wm to 
provoke it is consistent with that. the calls shouldn't be interruptible in any 
case (since it's not documented how the state might change), and so should be 
protected from the various signals used on unix to implement scheduling and 
interrupts (kills).

Original comment by Charles....@gmail.com on 27 Jul 2010 at 1:26

GoogleCodeExporter commented 9 years ago

I've been getting this XIO:  fatal IO error 4  more and more frequently these 
days.  I usually trip it during a mouse drag event.  All other processes 
continue to run and draw to the emu, there's just no way to continue 
interacting with any window other than if you had a shell open at the time when 
the XIO error tripped.  Even then the only interaction is keyboard, no mouse.

For some reason doing `emu -s` doesn't get the processes into the right state 
to actually core or set anything that is debuggable.  With luck I'll find the 
right linux incantation to exorcise the demons.  If not, it's off to Windows or 
Mac emu.

Original comment by jas@corpus-callosum.com on 14 Jan 2011 at 12:23

GoogleCodeExporter commented 9 years ago

that XIO error is announced by the X11 libraries and doesn't produce a core 
dump, so emu -s won't have any effect. i'm not even sure that it's actually 
caused by an "interrupted system call" (EINTR); i wonder if that's left behind 
by something else.

Original comment by Charles....@gmail.com on 14 Jan 2011 at 1:04

GoogleCodeExporter commented 9 years ago

Caught it by attaching the xproc in a separate session.  Finally triggered the 
error w/ a lot of drawing updates going on and getting wm/man to redraw some 
text (everything's slow over remote X11 anyway).

Breakpoint 2, 0xf74d14d4 in exit () from /lib32/libc.so.6
(gdb) bt
#0  0xf74d14d4 in exit () from /lib32/libc.so.6
#1  0xf766cb90 in _XDefaultIOError () from /usr/lib32/libX11.so.6
#2  0xf766cc16 in _XIOError () from /usr/lib32/libX11.so.6
#3  0xf7674a1a in ?? () from /usr/lib32/libX11.so.6
#4  0xf7675356 in _XEventsQueued () from /usr/lib32/libX11.so.6
#5  0xf7647720 in XCheckTypedWindowEvent () from /usr/lib32/libX11.so.6
#6  0x080669b5 in xmouse (arg=0x99d75b0) at ../port/win-x11a.c:1383
#7  xproc (arg=0x99d75b0) at ../port/win-x11a.c:550
#8  0x0804bbab in tramp (arg=0xa4dceb0) at os.c:90
#9  0xf7571b5e in clone () from /lib32/libc.so.6
(gdb) 

The line number is one off since I added getpid() to help find the xproc:

$ hg diff .
diff -r 345359f9f694 emu/port/win-x11a.c
--- a/emu/port/win-x11a.c   Mon Jan 10 21:23:38 2011 +0000
+++ b/emu/port/win-x11a.c   Fri Jan 14 11:16:39 2011 -0600
@@ -524,6 +524,7 @@
    XEvent event;
    XDisplay *xd;

+printf("pid: %d\n", getpid());
    closepgrp(up->env->pgrp);
    closefgrp(up->env->fgrp);
    closeegrp(up->env->egrp);

Original comment by jas@corpus-callosum.com on 14 Jan 2011 at 5:17

GoogleCodeExporter commented 9 years ago

your X11 isn't using libxcb. which one are you using?

Original comment by Charles....@gmail.com on 16 Jan 2011 at 10:09

GoogleCodeExporter commented 9 years ago

on openbsd i can crash inferno by running "wm/bounce 50" and continuously 
moving the mouse over the program.  it crashes within 10 seconds typically.

however, i can't attach to the process to get a stack trace.  when gdb tries 
the process disappears.  an abort signal to the process does the same, no core 
file.

this info probably doesn't help find the solution, but my setup may be useful 
for testing a solution.

Original comment by mechiel@ueber.net on 16 Jan 2011 at 9:47

GoogleCodeExporter commented 9 years ago

I'm using Ubuntu 10.x and will check on the libxcb issue on Monday.

I'm also going to test the following change to win-x11a.c to see if it helps 
filter out the events:

 diff win-x11a.c win-x11ab.c
526a527
> printf("pid: %d\n", getpid());
535c536
<       PointerMotionMask|
---
>       PointerMotionHintMask|
1379,1380d1379
<       me = (XMotionEvent *) e;
< 
1382,1383c1381,1386
<       while(XCheckTypedWindowEvent(xmcon, xdrawable, MotionNotify, &motion) == 
True)
<           me = (XMotionEvent *) &motion;
---
>       while(XCheckMaskEvent(xmcon, ButtonMotionMask, &motion);
>       if(!XQueryPointer(xmcon, xdrawable, &motion.xbutton.root,
>                         &motion.xbutton.window, &motion.xbutton.x_root,
>                         &motion.xbutton.y_root, &motion.xbutton.x,
>                         &motion.xbutton.y, &motion.xbutton.state);
>          return;
1384a1388
>       me = (XMotionEvent *) &motion;

Original comment by jas@corpus-callosum.com on 16 Jan 2011 at 10:07

GoogleCodeExporter commented 9 years ago

The above use of XQueryPointer got rid of the XIO error, with the side effect 
of not redrawing windows or button motion events until after button release.

The following change makes it a little more difficult to get X11 to error out, 
though it does still happen on occasion but more likely on a seg fault or XIO 
error code other than 4.  There still appear to be X libraries that are missing 
from the X11LIBS (Ubuntu's moved things around a bit more now that it's using 
X11R7).

$ hg diff .
diff -r b8d602ab2984 emu/Linux/mkfile
--- a/emu/Linux/mkfile  Mon Jan 17 17:05:49 2011 +0000
+++ b/emu/Linux/mkfile  Mon Jan 17 17:07:54 2011 -0600
@@ -12,7 +12,7 @@

 #end configurable parameters

-X11LIBS= -lX11 -lXext  # can remove or override using env section in config 
files
+X11LIBS= -lX11 -lxcb -lXext    # can remove or override using env section in 
config files

 <$ROOT/mkfiles/mkfile-$SYSTARG-$OBJTYPE    #set vars based on target system

diff -r b8d602ab2984 emu/port/win-x11a.c
--- a/emu/port/win-x11a.c   Mon Jan 17 17:05:49 2011 +0000
+++ b/emu/port/win-x11a.c   Mon Jan 17 17:07:54 2011 -0600
@@ -524,6 +524,7 @@
    XEvent event;
    XDisplay *xd;

+printf("xproc pid: %d\n", getpid());
    closepgrp(up->env->pgrp);
    closefgrp(up->env->fgrp);
    closeegrp(up->env->egrp);
@@ -533,11 +534,7 @@
    mask = ButtonPressMask|
        ButtonReleaseMask|
        PointerMotionMask|
-       Button1MotionMask|
-       Button2MotionMask|
-       Button3MotionMask|
-       Button4MotionMask|
-       Button5MotionMask|
+       ButtonMotionMask|
        ExposureMask|
        StructureNotifyMask;

Original comment by jas@corpus-callosum.com on 18 Jan 2011 at 2:06

GoogleCodeExporter commented 9 years ago

Back to the drawing board.  The above worked great for ~24 hours with multiple 
wm/bounce 50 windows open and my ~25fps graphs running.  Add a little more load 
on the system and the same crash occurs.  All of this indicates that flushing 
out MotionNotify events is just not happening fast enough to keep the xserver 
running.

Maybe it's time to  test XSetIOErrorHandler or just completely switching over 
to PointerMotionHintMask and eating the no-visuals-on-drag-until-release as 
that's the only way I've found to not generate the error.

Original comment by jas@corpus-callosum.com on 19 Jan 2011 at 3:19

GoogleCodeExporter commented 9 years ago

what do you think is happening?

Original comment by Charles....@gmail.com on 19 Jan 2011 at 3:36

GoogleCodeExporter commented 9 years ago

The event, or large group of events, causes xproc to exit without killing the 
actual window.  Drawing continues and keyboard access will work if you happend 
to have a shell or entry field highlighted at the time of the error.

Is there a reason that xkbdproc gets a silly KPX11 stack size but xproc does 
not?  I'm basically hunting for a needle in a haystack washed away by floods.

Original comment by jas@corpus-callosum.com on 19 Jan 2011 at 3:52

GoogleCodeExporter commented 9 years ago

you could try increasing KSTACK if you think it's a stack overflow in xproc. 
the older X11 code i'm looking at does put a few big buffers on the stack 
(which is only 16k).
xkbdproc got a huge stack because locale code in x11 read a vast number of 
names from /usr/lib onto the stack. as a result, xkbdproc can't use "up", but 
that doesn't matter for it. xproc needs "up", so the same huge stack hack won't 
work for it.

Original comment by Charles....@gmail.com on 19 Jan 2011 at 4:32

GoogleCodeExporter commented 9 years ago

An increased KSTACK didn't help.  I put a few other pritntfs in place and set 
up another XIOErrorHandler to allow a quick gdb attach before the default Xlib 
exit() from the error.  The XIO error always occurs during a traversal of a 
MotionNotify event, possible due to some threading issues.

There are plenty write ups out there about the difficulty of Xlib,  XShm, and 
threading, especially when processing various components in the event loop.  
I've seen suggestions that going fully to XCB _might_ prove beneficial.  If 
nothing else, I'll increment the reading of the  "I hate Xlib and so should 
you" post.

Original comment by jas@corpus-callosum.com on 20 Jan 2011 at 6:31

GoogleCodeExporter commented 9 years ago

what happens if you simply delete

        /* remove excess MotionNotify events from queue and keep last one */
        while(XCheckTypedWindowEvent(xmcon, xdrawable, MotionNotify, &motion) == True)
            me = (XMotionEvent *) &motion;

from win-x11a's MotionNotify case. mousetrack does that better itself anyway.

Original comment by Charles....@gmail.com on 20 Jan 2011 at 9:16

GoogleCodeExporter commented 9 years ago

Same error:

XIO:  fatal IO error 4 (Interrupted system call) on X server ":0.0"
      after 3039920 requests (3039918 known processed) with 0 events remaining.

I'm thinking of trying a new XCB only port of win-x11.  Though the GNU/Linux 
version I'm using does link libxcb in with libX11:

$ ldd o.emu
    linux-gate.so.1 =>  (0x00d95000)
    libX11.so.6 => /usr/lib/libX11.so.6 (0x00d9d000)
    libXext.so.6 => /usr/lib/libXext.so.6 (0x00993000)
    libm.so.6 => /lib/libm.so.6 (0x007bb000)
    libc.so.6 => /lib/libc.so.6 (0x009e3000)
    libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00860000)
    libdl.so.2 => /lib/libdl.so.2 (0x0092d000)
    /lib/ld-linux.so.2 (0x00314000)
    libXau.so.6 => /usr/lib/libXau.so.6 (0x00c83000)
    libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00d77000)

there needs to be a different approach taken to get past this XIO issue.

Original comment by jas@corpus-callosum.com on 20 Jan 2011 at 3:35

GoogleCodeExporter commented 9 years ago

although that might be true, i think it would still probably be better to work 
out first more precisely how that particular error arises. (i'd remove the 
XCheckTypedWindowEvent code in any case to reduce the number of primitives 
involved.)

Original comment by Charles....@gmail.com on 20 Jan 2011 at 3:40

GoogleCodeExporter commented 9 years ago

More details with a few extra modifications to be able to trap the error.  
First, I use this diff of win-x11a.diff to help set up the important bits for 
being able to catch this error.  Then start o.emu in gdb and 'handle SIGTRAP' 
so we can start to generate a core file.  After starting wm/wm, use the xproc 
pid and start two more gdb session, one with the xproc pid and one the process 
just before it that should be the devpointer process.  Both gdb sessions should 
handle SIGTRAP and 'handle SIGUSR2 nostop noprint' (otherwise it dumps every 
mouse move and expose event from the emu wm window).

The addition of the shm_ioehandler() sets up a place to attach to the process 
that causes the XIO error.  Once I get the error pid, a new gdb session 
attaches to the pid printed.  The bt.txt attachment is the backtrace from that 
process.

Additionally the xproc process can be interrupt after XIO pid has been attached 
and everything else is somewhat halted.  All that produces an additional 
backtrace from the xepose() call in win-x11a.c:

(gdb) bt
#0  0x0012e416 in __kernel_vsyscall ()
#1  0x002bcd47 in sigsuspend () from /lib/libc.so.6
#2  0x0804b1a0 in osblock () at os.c:242
#3  0x0804bbe8 in qlock (q=0x81d8b00) at ../port/lock.c:59
#4  0x08051d30 in drawqlock () at ../port/devdraw.c:1983
#5  0x08069350 in xexpose (arg=0x8268290) at ../port/win-x11a.c:1147
#6  xproc (arg=0x8268290) at ../port/win-x11a.c:557
#7  0x0804b59b in tramp (arg=0x8a4e1b0) at os.c:90
#8  0x003626ae in clone () from /lib/libc.so.6

Original comment by jas@corpus-callosum.com on 20 Jan 2011 at 5:30

Attachments:

jayduhon / inferno-os

x86 X11 problem with interrupted system call #93