s-u / Cairo

R graphics device using cairo graphics library for creating high-quality output
12 stars 10 forks source link

CairoX11 crash when connection is lost #27

Closed jmw86069 closed 2 years ago

jmw86069 commented 4 years ago

I am using R-3.6.1, Cairo 1.5.12, and working remotely. I was using X11() to view R plots remotely over ssh, but it is painfully slow -- sometimes few minutes just to run plot(1:10). Using CairoX11() is much faster -- not instantaneous, but certainly very usable. Kudos and thanks for that!

The problem is whenever my connection is lost, it crashes my whole R session. Even if I run dev.off() first, it crashes R.

For example:

packageVersion("Cairo")
# [1] '1.5.12'
library(Cairo)
CairoX11(display="localhost:10.0")
plot(1:10)
dev.off()
# null device
#           1

Then if for some reason my ssh is lost -- bad VPN, bad wifi, bad cable modem, bad luck...

> XIO:  fatal IO error 0 (Success) on X server "localhost:10.0"
      after 99 requests (99 known processed) with 0 events remaining.

(I love that it says "(Success)" like some Jedi mind trick.) ;)

And R crashes. Is there some troubleshooting you can suggest? I was hoping it's something simple I'm doing wrong. :)

My workaround is to save the R session frequently, or to save plots to PDF and sync files to my laptop. But that's no fun, and misses out on using this nice package!

jmw86069 commented 4 years ago

After some digging, my best guess is that the R internal function GEkillDevice() defined in R_ext/GraphicsDevices.h does not make the proper calls to close X11 device. It seems to leave the connection open instead of closing the connection.

See: https://stackoverflow.com/questions/10792361/how-do-i-gracefully-exit-an-x11-event-loop

In Cairo the file xlib-backend.c defines function handleDisplayEvent()

s-u commented 4 years ago

@jmw86069 Thanks! I'm not sure that this has anything to do with GEkillDevice() because R doesn't actually know that the connection went away. I got as far as figuring out that the event handling in handleDisplayEvent calls XCheckTypedEvent which then bails due to the closed connection. I was trying to find a way how to have X11 API to exit but return an error instead, but didn't find a way yet. First step is to detect when the X11 connection closes then we can worry what to do on the R side ...

s-u commented 4 years ago

I can confirm that this is close to impossible to handle. The issue is that Xlib functions are designed to take down the process if the connection closes - they assume that if you lose X11 it's equivalent to killing the process. They don't provide a way around it (there is XSetIOErrorHandler but it must not return, it is expected to kill the process), it's a design decision. I have tried a hacky work-around where I can use setjmp/longjmp with XSetIOErrorHandler and I can detect and handle the loss of a connection inside the R event loop, but there are many other calls to the Xlib API and any of them can kill the process, so I don't see a safe way to guard is short of replacing all Xlib calls.

AFAICT the X11 device does something very similar - it uses XSetIOErrorHandler globally - you get this:

 Error: X11 fatal IO error: please save work and shut down R

but that means if we use it then the X11 won't work and vice-versa... In fact, if you open the X11() device then that's enough to register the handler and even Cairo() won't crash then. So maybe using the global handler is a way to go, but it's terribly fragile - essentially what you want is by design not supported by X11.

jmw86069 commented 4 years ago

Wow that's pretty brilliant. The error Error: X11 fatal IO error: please save work and shut down R does not crash the R session. Not sure why it gets caught and "handled" somehow to avoid a crash, but it does. So this workflow successfully* avoids crashing the R session, even when the connection is lost:

X11(display="localhost:10.0", type="Xlib");
library(Cairo);
CairoX11(display="localhost:10.0");
plot(1:10);

When connection is lost, it prints an error but maintains the R session. (Awesome.) Now I can keep a remote R session active over ssh, use Cairo when needed, and recover without needing to save R session every few minutes in fear!

And this Cairo package is great, such a step up from X11() in terms of performance and visual quality. No disrespect to X11(), but I'm glad to have a faster alternative.

s-u commented 4 years ago

I can use the same handing they do (so you don't need to explicitly call X11), the only issue is that Xlib supports only one handler, so if X11 gets started after Cairo it will remove our handler and vice-versa. I'll re-open it since it would be nice to add it.

The thing about the the error is that Xlib insists that you kill the process after entering the handler, just R's X11 doesn't do so. Technically, that's breaking the requirement of Xlib - whether that does something bad or not depends entirely on Xlib, but likely it is just leaking memory or something like that.

jmw86069 commented 2 years ago

I know this issue is a placeholder for now, I just wanted to report that I still use Cairo as the only X11-friendly remote plotting interface, with the workaround:

X11()
dev.off()
library(Cairo)
CairoX11()

When I detach my R session (via tmux or GNU screen), it does not immediately crash the R session, as it did when using X11 alone - always a good thing. :)

The only small problem: when I detach my R session the cpu usage of the R process rises to 100% and stays there continuously. It's only one cpu, not a huge deal. I tested whether dev.off() before detaching would help, but it did not seem to prevent 100% cpu usage after detaching.

Also, even when re-attaching to the R session, cpu usage stays at 100%. Even when creating a new CairoX11() output device, it stays at 100%. Closing output devices does not resolve the issue.

I imagine CairoX11 (or X11) is constantly trying to re-establish a link to the remote X windows - it seems like a symptom of continuous polling. I don't really understand this process, but if it's feasible to insert a small delay, or some logic to determine if the link is severed, it might fix the 100% cpu usage issue.

s-u commented 2 years ago

@jmw86069 let me know if the latest version (1.5-15) fixes it. It should be used without calling X11 first since it now provides its own error handler that should clean up things properly.

jmw86069 commented 2 years ago

Oh my... I can't believe it's already May when I'm testing the update! It slipped by without me noticing. Sorry for the delay on my part!

I tested the current version of Cairo from Github "1.2.0" and it works! Thank you so much! :) (I'm not sure the versioning, the DESCRIPTION shows 1.2.0 and my R says 1.2.0, so I'll assume that's correct for now.)

By "it works" here is what works:

It still displays this error message, but there are otherwise no other adverse symptoms: "Error: X11 fatal IO error: please save work and shut down R"

As a bonus, it does not take 100% cpu after detaching the R session.

Thank you again sir! :)