Closed aszlig closed 4 years ago
I tried i (Hopefully I recompiled it correctly) but it does not seem to work :( . I wrote a script to easily test rsession:
#/bin/bash
I2U="ip2unix -r path=rsession.sock"
LDP=
#I2U=
#LDP=./libsocket_wrapper.so
#mkdir $PWD/rsession.sock
Rscript -e 'cat(paste(R.home("home"),R.home("share"),R.home("include"),R.home("doc"),getRversion(),sep="\n"))'|
{
readarray -t R
R_HOME=${R[0]} R_SHARE_DIR=${R[1]} R_INCLUDE_DIR=${R[2]} R_DOC_DIR=${R[3]} RSTUDIO_DEFAULT_R_VERSION_HOME=${R[0]} RSTUDIO_DEFAULT_R_VERSION=${R[4]} \
SOCKET_WRAPPER_DIR=$PWD/rsession.sock LD_PRELOAD=$LDP $I2U rsession --standalone=1 --program-mode=server --log-stderr=1 --www-address=127.0.0.1 --www-port=12345 &
echo $! >rsession.pid
}
sleep 3
SOCKET_WRAPPER_DIR=$PWD/rsession.sock LD_PRELOAD=$LDP $I2U curl -I http://127.0.0.1:12345
curl -I http://127.0.0.1:12345
kill $(< rsession.pid) && rm rsession.pid
rm -r rsession.sock
relevant output:
+ ip2unix -r path=rsession.sock curl -I http://127.0.0.1:12345
curl: (7) Couldn't connect to server
+ curl -I http://127.0.0.1:12345
curl: (7) Failed to connect to 127.0.0.1 port 12345: Connection refused
+ kill 495701
+ rm rsession.pid
+ rm -r rsession.sock
rm: cannot remove ‘rsession.sock’: No such file or directory
@riedel: Just added a test which specifically tests rsession
:
Unfortunately, the test succeeds. Here's the output with the grep
command removed:
building '/nix/store/yn17s3psa0c70c2vi0934f61m36ba167-test-rsession.drv'...
Listener bound to address 127.0.0.1 port 8080
<!DOCTYPE html>
<!--
#
# index.htm
#
# Copyright (C) 2009-19 by RStudio, Inc.
#
# This program is licensed to you under the terms of version 3 of the
# GNU Affero General Public License. This program is distributed WITHOUT
# ANY EXPRESS OR IMPLIED WARRANTY, INCLUDING THOSE OF NON-INFRINGEMENT,
# MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Please refer to the
# AGPL (http://www.gnu.org/licenses/agpl-3.0.txt) for more details.
#
-->
<!-- standards mode -->
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
<meta name="gwt:property" content="compiler.stackMode=native"/>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" />
<meta name="csrf-token" content="b83430b1-8f56-4d05-82d4-c6b6f98f5607" />
<link rel="shortcut icon" href="images/favicon.ico" />
<title>RStudio</title>
<script>window.program_mode = "server";
</script>
<link type="text/css" rel="stylesheet" href="css/icons.css" />
<script type="text/javascript" language="javascript" src="rstudio/rstudio.nocache.js"></script>
</head>
<body>
</body>
</html>
I also checked whether rsession
uses epoll_ctl
just to be sure and also compared the test against ip2unix
from master
(which fails as expected).
Is there something I'm doing wrong or something I'm forgetting (I'm no R expert)?
Love your test! Nix makes it really easy, impressive! I will test it on more systems. But this is strange! The system where I need it (and tested it because I had the build env installed) is a cluster node with redhat enterprise linux 7 with conda as package manager and a GPFS filesystem. I really now have a case for switching to nix :) Give me some time and I will get it to run somewhere.
Love your test! Nix makes it really easy, impressive! I will test it on more systems. But this is strange! The system where I need it (and tested it because I had the build env installed) is a cluster node with redhat enterprise linux 7 with conda as package manager and a GPFS filesystem. I really now have a case for switching to nix :) Give me some time and I will get it to run somewhere.
While Nix does make a lot of things easier in this regard (after all that's the reason I'm using it), I'd actually want to know why it doesn't work on other distros, since "just use Nix" is probably not the best idea to recommend to people who want to use your project :-D
Anyway, if you strace
, is the epoll_ctl
call properly replayed (as in: it should be called twice, once for the original socket and the replaced socket)?
Everything looks perfectly fine! I am really puzzled, too, why a bind would not result into a filesystem entry.
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=3395039680, u64=94153373135296}}) = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
socket(AF_UNIX, SOCK_STREAM, 0) = 7
fcntl(6, F_GETFD) = 0
fcntl(7, F_SETFD, 0) = 0
fcntl(6, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(7, F_SETFL, O_RDWR) = 0
fcntl(6, F_GETSIG) = 0
fcntl(7, F_SETSIG, 0) = 0
fcntl(6, F_GETOWN_EX, {type=F_OWNER_PID, pid=0}) = 0
fcntl(7, F_SETOWN_EX, {type=F_OWNER_PID, pid=0}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=3395039680, u64=94153373135296}}) = 0
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
dup2(7, 6) = 6
close(7) = 0
bind(6, {sa_family=AF_UNIX, sun_path="/smartdata/iu5681/rsession.sock"}, 110) = 0
listen(6, 128) = 0
ioctl(6, FIONBIO, [1]) = 0
accept4(6, NULL, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
bind(6, {sa_family=AF_UNIX, sun_path="/smartdata/iu5681/rsession.sock"}, 110) = 0
This would suggest, that the socket file should actually be there, but for some reason it gets closed and unlinked. Can you post what happens to file descriptor 6 afterwards?
that is the strange thing: nothing happens and its not there. I even switched to /var/tmp for an xfs filesystem because I am so confused about this. I grepped for '(6' and unlink: https://gist.github.com/riedel/043936e3c80da0f2607d48d8205d63e0#file-rsession-strace-L743. I even asked on stackoverflow if such thing can even happen: https://stackoverflow.com/questions/61279813/successfull-call-to-bind2-with-af-unix-does-not-generate-a-socket-file . Don't ask me how confused I am about this.
@riedel: Hm, first of all: Is /smartdata/iu5681
on a different file system and if yes, which file system type?
Also, can you repeat strace
with the -f
option, since just after binding to the socket, there is a clone
syscall involved?
Doesn't get better: https://gist.github.com/riedel/043936e3c80da0f2607d48d8205d63e0#file-rsession-strace-L759
The filesystem is a IBM GPFS, I also had the suspicion before, for this run I switched to:
/dev/mapper/vg01-lv_var on /var type xfs (rw,relatime,attr2,inode64,noquota)
Doesn't get better
Right, but at least we now know where the socket file gets unlinked:
https://gist.github.com/riedel/043936e3c80da0f2607d48d8205d63e0#file-rsession-strace-L3960
Okay, so the interesting part here is that it happens prior to executing /bin/sh -c 'git "--version"'
, where all file descriptors are closed. However, since after the clone the file descriptor is still active in the parent, we should not unlink
the socket file.
To fix this, we need to implement some kind of reference counting... oh geesh...
@riedel: Which glibc version do you have? According to distrowatch, RHEL 7 should have 2.17, is this correct?
@riedel: Let's continue in #16, since epoll_ctl
should be properly handled here.
Some services, particularily the
rsession
command from RStudio and possibly a few others useepoll_ctl()
for adding the file descriptor of the socket before actually binding the socket.Since we need
sockaddr
information from thebind
syscall to be able to decide whether we need to transform the socket into a Unix domain socket or not, we have to handleepoll_ctl
alongside other socket operations since once we replace the socket the file descriptor added viaepoll_ctl
is no longer valid.Right now the way we replay
epoll_ctl()
calls is somewhat dumb, so for example if the program is doingEPOLL_CTL_ADD
, followed byEPOLL_CTL_MOD
, we essentially repeat the same chain even though we could just shrink it down to oneEPOLL_CTL_ADD
.Nevertheless, since
epoll_ctl
usually is only used once per socket, this shouldn't be an issue.@riedel: Can you please check whether this fixes your issue with
rsession
?