Open GoogleCodeExporter opened 9 years ago
I've just seen that without the JIT, so i don't think it's JIT related.
i think it's the use of signals to control scheduling of the vm and kills
within it
for linux.
the most likely thing is that one of those signals catches and interrupts a
write to
the X server,
but it only happens when an underlying socket or queue fills up, hence the need
for a
bit of load;
and the signal needs to arrive during that particular write.
Original comment by Charles....@gmail.com
on 20 Mar 2009 at 4:04
I use acme pretty heavily for development, and I experience the X errors
outlined here, as well as in Issue 188, pretty consistently throughout the
day... work a bit, end up with a pretty big/busy acme window, and eventually
this error occurs sooner or later,without fail.
I'm running debian lenny, 2.6.26-2-686 (not 64 bit), with the tgz on
www.vitanuova.com from 20100120... I've also tried with an up-to-date build (as
of a few days ago, before I went back to the packaged download assuming it was
more stable), and same deal...
trying a fresh build now, but it seems like the issue hasn't been completely
fixed yet, based on the issues/comments in here?
I'd love to help fix this, but unfortunately I've no experience w/ X11 or
inferno (OS) development... I like to think I'm a pretty good programmer
though, so despite that if someone (Charles?) wants to point me to a starting
point I'd be happy to start hacking away at it and see if I can figure out
what's going on... in the absence of a reply, I'm going to see what I can do
anyways - I've been meaning to learn the inferno src and internals for a while
now, this is a good excuse to do so...
Original comment by datawh...@gmail.com
on 10 Jun 2010 at 5:04
i suspect there should be osenter/osleave calls surrounding X11 and MacOSX
function calls in the various win.c when those can be called from devdraw.c.
equivalently, and perhaps easier, SIGUSR1 and perhaps other signals should
simply be blocked during those calls. there's no way of knowing how those calls
are implemented, and if the SIGUSR1 interrupt hits a read or write system call
(or some other interruptible system call used by the host library), i think it
will break.
Original comment by Charles....@gmail.com
on 27 Jul 2010 at 11:01
Original comment by Charles....@gmail.com
on 27 Jul 2010 at 11:01
that was incorrect: unless osenter/osleave <i>are</i> called, the process
shouldn't be receiving the interrupt signal (SIGUSR1), so that earlier
suggestion was the wrong way round. even so, i've got the feeling the answer is
along those lines: the system calls making the updates for flushmemscreen are
being interrupted (at least on Linux), and the action i'd just taken in wm to
provoke it is consistent with that. the calls shouldn't be interruptible in any
case (since it's not documented how the state might change), and so should be
protected from the various signals used on unix to implement scheduling and
interrupts (kills).
Original comment by Charles....@gmail.com
on 27 Jul 2010 at 1:26
I've been getting this XIO: fatal IO error 4 more and more frequently these
days. I usually trip it during a mouse drag event. All other processes
continue to run and draw to the emu, there's just no way to continue
interacting with any window other than if you had a shell open at the time when
the XIO error tripped. Even then the only interaction is keyboard, no mouse.
For some reason doing `emu -s` doesn't get the processes into the right state
to actually core or set anything that is debuggable. With luck I'll find the
right linux incantation to exorcise the demons. If not, it's off to Windows or
Mac emu.
Original comment by jas@corpus-callosum.com
on 14 Jan 2011 at 12:23
that XIO error is announced by the X11 libraries and doesn't produce a core
dump, so emu -s won't have any effect. i'm not even sure that it's actually
caused by an "interrupted system call" (EINTR); i wonder if that's left behind
by something else.
Original comment by Charles....@gmail.com
on 14 Jan 2011 at 1:04
Caught it by attaching the xproc in a separate session. Finally triggered the
error w/ a lot of drawing updates going on and getting wm/man to redraw some
text (everything's slow over remote X11 anyway).
Breakpoint 2, 0xf74d14d4 in exit () from /lib32/libc.so.6
(gdb) bt
#0 0xf74d14d4 in exit () from /lib32/libc.so.6
#1 0xf766cb90 in _XDefaultIOError () from /usr/lib32/libX11.so.6
#2 0xf766cc16 in _XIOError () from /usr/lib32/libX11.so.6
#3 0xf7674a1a in ?? () from /usr/lib32/libX11.so.6
#4 0xf7675356 in _XEventsQueued () from /usr/lib32/libX11.so.6
#5 0xf7647720 in XCheckTypedWindowEvent () from /usr/lib32/libX11.so.6
#6 0x080669b5 in xmouse (arg=0x99d75b0) at ../port/win-x11a.c:1383
#7 xproc (arg=0x99d75b0) at ../port/win-x11a.c:550
#8 0x0804bbab in tramp (arg=0xa4dceb0) at os.c:90
#9 0xf7571b5e in clone () from /lib32/libc.so.6
(gdb)
The line number is one off since I added getpid() to help find the xproc:
$ hg diff .
diff -r 345359f9f694 emu/port/win-x11a.c
--- a/emu/port/win-x11a.c Mon Jan 10 21:23:38 2011 +0000
+++ b/emu/port/win-x11a.c Fri Jan 14 11:16:39 2011 -0600
@@ -524,6 +524,7 @@
XEvent event;
XDisplay *xd;
+printf("pid: %d\n", getpid());
closepgrp(up->env->pgrp);
closefgrp(up->env->fgrp);
closeegrp(up->env->egrp);
Original comment by jas@corpus-callosum.com
on 14 Jan 2011 at 5:17
your X11 isn't using libxcb. which one are you using?
Original comment by Charles....@gmail.com
on 16 Jan 2011 at 10:09
on openbsd i can crash inferno by running "wm/bounce 50" and continuously
moving the mouse over the program. it crashes within 10 seconds typically.
however, i can't attach to the process to get a stack trace. when gdb tries
the process disappears. an abort signal to the process does the same, no core
file.
this info probably doesn't help find the solution, but my setup may be useful
for testing a solution.
Original comment by mechiel@ueber.net
on 16 Jan 2011 at 9:47
I'm using Ubuntu 10.x and will check on the libxcb issue on Monday.
I'm also going to test the following change to win-x11a.c to see if it helps
filter out the events:
diff win-x11a.c win-x11ab.c
526a527
> printf("pid: %d\n", getpid());
535c536
< PointerMotionMask|
---
> PointerMotionHintMask|
1379,1380d1379
< me = (XMotionEvent *) e;
<
1382,1383c1381,1386
< while(XCheckTypedWindowEvent(xmcon, xdrawable, MotionNotify, &motion) ==
True)
< me = (XMotionEvent *) &motion;
---
> while(XCheckMaskEvent(xmcon, ButtonMotionMask, &motion);
> if(!XQueryPointer(xmcon, xdrawable, &motion.xbutton.root,
> &motion.xbutton.window, &motion.xbutton.x_root,
> &motion.xbutton.y_root, &motion.xbutton.x,
> &motion.xbutton.y, &motion.xbutton.state);
> return;
1384a1388
> me = (XMotionEvent *) &motion;
Original comment by jas@corpus-callosum.com
on 16 Jan 2011 at 10:07
The above use of XQueryPointer got rid of the XIO error, with the side effect
of not redrawing windows or button motion events until after button release.
The following change makes it a little more difficult to get X11 to error out,
though it does still happen on occasion but more likely on a seg fault or XIO
error code other than 4. There still appear to be X libraries that are missing
from the X11LIBS (Ubuntu's moved things around a bit more now that it's using
X11R7).
$ hg diff .
diff -r b8d602ab2984 emu/Linux/mkfile
--- a/emu/Linux/mkfile Mon Jan 17 17:05:49 2011 +0000
+++ b/emu/Linux/mkfile Mon Jan 17 17:07:54 2011 -0600
@@ -12,7 +12,7 @@
#end configurable parameters
-X11LIBS= -lX11 -lXext # can remove or override using env section in config
files
+X11LIBS= -lX11 -lxcb -lXext # can remove or override using env section in
config files
<$ROOT/mkfiles/mkfile-$SYSTARG-$OBJTYPE #set vars based on target system
diff -r b8d602ab2984 emu/port/win-x11a.c
--- a/emu/port/win-x11a.c Mon Jan 17 17:05:49 2011 +0000
+++ b/emu/port/win-x11a.c Mon Jan 17 17:07:54 2011 -0600
@@ -524,6 +524,7 @@
XEvent event;
XDisplay *xd;
+printf("xproc pid: %d\n", getpid());
closepgrp(up->env->pgrp);
closefgrp(up->env->fgrp);
closeegrp(up->env->egrp);
@@ -533,11 +534,7 @@
mask = ButtonPressMask|
ButtonReleaseMask|
PointerMotionMask|
- Button1MotionMask|
- Button2MotionMask|
- Button3MotionMask|
- Button4MotionMask|
- Button5MotionMask|
+ ButtonMotionMask|
ExposureMask|
StructureNotifyMask;
Original comment by jas@corpus-callosum.com
on 18 Jan 2011 at 2:06
Back to the drawing board. The above worked great for ~24 hours with multiple
wm/bounce 50 windows open and my ~25fps graphs running. Add a little more load
on the system and the same crash occurs. All of this indicates that flushing
out MotionNotify events is just not happening fast enough to keep the xserver
running.
Maybe it's time to test XSetIOErrorHandler or just completely switching over
to PointerMotionHintMask and eating the no-visuals-on-drag-until-release as
that's the only way I've found to not generate the error.
Original comment by jas@corpus-callosum.com
on 19 Jan 2011 at 3:19
what do you think is happening?
Original comment by Charles....@gmail.com
on 19 Jan 2011 at 3:36
The event, or large group of events, causes xproc to exit without killing the
actual window. Drawing continues and keyboard access will work if you happend
to have a shell or entry field highlighted at the time of the error.
Is there a reason that xkbdproc gets a silly KPX11 stack size but xproc does
not? I'm basically hunting for a needle in a haystack washed away by floods.
Original comment by jas@corpus-callosum.com
on 19 Jan 2011 at 3:52
you could try increasing KSTACK if you think it's a stack overflow in xproc.
the older X11 code i'm looking at does put a few big buffers on the stack
(which is only 16k).
xkbdproc got a huge stack because locale code in x11 read a vast number of
names from /usr/lib onto the stack. as a result, xkbdproc can't use "up", but
that doesn't matter for it. xproc needs "up", so the same huge stack hack won't
work for it.
Original comment by Charles....@gmail.com
on 19 Jan 2011 at 4:32
An increased KSTACK didn't help. I put a few other pritntfs in place and set
up another XIOErrorHandler to allow a quick gdb attach before the default Xlib
exit() from the error. The XIO error always occurs during a traversal of a
MotionNotify event, possible due to some threading issues.
There are plenty write ups out there about the difficulty of Xlib, XShm, and
threading, especially when processing various components in the event loop.
I've seen suggestions that going fully to XCB _might_ prove beneficial. If
nothing else, I'll increment the reading of the "I hate Xlib and so should
you" post.
Original comment by jas@corpus-callosum.com
on 20 Jan 2011 at 6:31
what happens if you simply delete
/* remove excess MotionNotify events from queue and keep last one */
while(XCheckTypedWindowEvent(xmcon, xdrawable, MotionNotify, &motion) == True)
me = (XMotionEvent *) &motion;
from win-x11a's MotionNotify case. mousetrack does that better itself anyway.
Original comment by Charles....@gmail.com
on 20 Jan 2011 at 9:16
Same error:
XIO: fatal IO error 4 (Interrupted system call) on X server ":0.0"
after 3039920 requests (3039918 known processed) with 0 events remaining.
I'm thinking of trying a new XCB only port of win-x11. Though the GNU/Linux
version I'm using does link libxcb in with libX11:
$ ldd o.emu
linux-gate.so.1 => (0x00d95000)
libX11.so.6 => /usr/lib/libX11.so.6 (0x00d9d000)
libXext.so.6 => /usr/lib/libXext.so.6 (0x00993000)
libm.so.6 => /lib/libm.so.6 (0x007bb000)
libc.so.6 => /lib/libc.so.6 (0x009e3000)
libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00860000)
libdl.so.2 => /lib/libdl.so.2 (0x0092d000)
/lib/ld-linux.so.2 (0x00314000)
libXau.so.6 => /usr/lib/libXau.so.6 (0x00c83000)
libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00d77000)
there needs to be a different approach taken to get past this XIO issue.
Original comment by jas@corpus-callosum.com
on 20 Jan 2011 at 3:35
although that might be true, i think it would still probably be better to work
out first more precisely how that particular error arises. (i'd remove the
XCheckTypedWindowEvent code in any case to reduce the number of primitives
involved.)
Original comment by Charles....@gmail.com
on 20 Jan 2011 at 3:40
More details with a few extra modifications to be able to trap the error.
First, I use this diff of win-x11a.diff to help set up the important bits for
being able to catch this error. Then start o.emu in gdb and 'handle SIGTRAP'
so we can start to generate a core file. After starting wm/wm, use the xproc
pid and start two more gdb session, one with the xproc pid and one the process
just before it that should be the devpointer process. Both gdb sessions should
handle SIGTRAP and 'handle SIGUSR2 nostop noprint' (otherwise it dumps every
mouse move and expose event from the emu wm window).
The addition of the shm_ioehandler() sets up a place to attach to the process
that causes the XIO error. Once I get the error pid, a new gdb session
attaches to the pid printed. The bt.txt attachment is the backtrace from that
process.
Additionally the xproc process can be interrupt after XIO pid has been attached
and everything else is somewhat halted. All that produces an additional
backtrace from the xepose() call in win-x11a.c:
(gdb) bt
#0 0x0012e416 in __kernel_vsyscall ()
#1 0x002bcd47 in sigsuspend () from /lib/libc.so.6
#2 0x0804b1a0 in osblock () at os.c:242
#3 0x0804bbe8 in qlock (q=0x81d8b00) at ../port/lock.c:59
#4 0x08051d30 in drawqlock () at ../port/devdraw.c:1983
#5 0x08069350 in xexpose (arg=0x8268290) at ../port/win-x11a.c:1147
#6 xproc (arg=0x8268290) at ../port/win-x11a.c:557
#7 0x0804b59b in tramp (arg=0x8a4e1b0) at os.c:90
#8 0x003626ae in clone () from /lib/libc.so.6
Original comment by jas@corpus-callosum.com
on 20 Jan 2011 at 5:30
Attachments:
Original issue reported on code.google.com by
eri...@gmail.com
on 2 Jun 2008 at 3:54