gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
594 stars 196 forks source link

[LibOS] `__process_pending_options()` is buggy in `shim_socket.c` #87

Closed dzygann closed 2 years ago

dzygann commented 3 years ago

Hi,

In want to run the mumble server (https://github.com/mumble-voip/mumble) called murmur in Graphene. My OS is Ubuntu 20.04.

I copied the memcached manifest template and the Makefile and customized it to my needs. When I run graphene in direct mode I got the following error from murmur:

Failed to set initial capabilities

Could be the issue, that some system calls are missing? The log throws a couple of warnings around the error message:

[P255363:T1:murmurd] debug: glibc register library /lib/x86_64-linux-gnu/libgpg-error.so.0 loaded at 0x5599043ca000
[P255363:T1:murmurd] debug: adding a library for gdb: file:/lib/x86_64-linux-gnu/libgpg-error.so.0
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call prctl
[P255363:T1:murmurd] warning: Unsupported system call sysinfo
[P255363:T1:murmurd] debug: Creating pipe: pipe.srv:8e8056f4f079e6d4c14a29f20d09a6d4039c90784f069159aa22a9691f2098b8
[P255363:T1:murmurd] debug: Creating pipe: pipe.srv:a2c4dc9fd2c5151d7fae45d205f377c50f1ab2868222fcba9469e3875db0d86e
[P255363:T1:murmurd] debug: Creating pipe: pipe.srv:2410492390f15c7266c199e70a5f57b1cb967c8ea258aa795b33a60f338ebc22
[P255363:T1:murmurd] debug: Creating pipe: pipe.srv:25d0a8cb62f22e2b187fbf820d0362b0fcf4cbfa05f7a46da3f8ff5e3d0e50a5
[P255363:T1:murmurd] warning: Unsupported system call capget
[P255363:T1:murmurd] warning: Unsupported system call capset
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
Failed to set initial capabilities                                                             <- murmur error message
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx
[P255363:T1:murmurd] warning: Unsupported system call statx

Before Graphene stops to work it writes the following debug message into the log:

[P255501:T1:murmurd] debug: process 255501 exited with status 0
[P255503:T2:murmurd] debug: ipc send to 255501: IPC_MSG_LEASE
[P255503:T2:murmurd] debug: Sending ipc message to 255501
[P255503:T2:murmurd] debug: Waiting for response (seq = 2)
[P255503:i1:murmurd] debug: IPC leader disconnected
[P255503:i1:murmurd] debug: Unknown process (vmid: 0x3e60d) disconnected

Is my assumption correct or do I miss something else?

boryspoplawski commented 3 years ago

Graphene currently does not support thread capabilities (including capget and capset syscalls) and I'm not aware of any plans to do so. I can think of 3 ways you can go with:

dzygann commented 3 years ago

Hi @boryspoplawski

I would like to implement a dummy syscall implementation. Do you have some documentation for it or could you point to some src files where I can see how to implement it?

----- Edit ----

Unfortunately, it's not enough to comment out the capabilities functions. The application stops again, but this time without a message.

[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call prctl
[P6141:T1:murmurd] warning: Unsupported system call sysinfo
[P6141:T1:murmurd] debug: Creating pipe: pipe.srv:3d54044c497df4decce34df2c0fc58e7fc3450a767b5ca28941a36bfe948e101
[P6141:T1:murmurd] debug: Creating pipe: pipe.srv:bd3650359271a90ab9fdb2a454b436257a243445877f07904818c3fce3aedee1
[P6141:T1:murmurd] debug: Creating pipe: pipe.srv:c603d77c9ae8998c1f3b2c068d0fda21e908087f2b4c65acd962c6ba0a6b36c7
[P6141:T1:murmurd] debug: Creating pipe: pipe.srv:615d216a41706ba03f227e4685fba79e67bfe2f9970416a4443a0f6c556dc061
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx
[P6141:T1:murmurd] warning: Unsupported system call statx

And here it last lines before the applications stops

[P6143:murmurd] debug: memory entry [0x7f7c3887cde0]: 0x2769b2bb1000-0x2769b2bd2000
[P6143:murmurd] debug: restored memory from checkpoint
[P6143:murmurd] debug: receiving 8 PAL handles
[P6143:murmurd] debug: restoring checkpoint at 0x7f7c3887b000 rebased from 0x7fb83f863000
[P6143:T2:murmurd] debug: successfully restored checkpoint at 0x7f7c3887b000 - 0x7f7c38891b10
[P6143:T2:murmurd] debug: Creating pipe: pipe.srv:6143
[P6143:T2:murmurd] debug: sending IPC_MSG_CONNBACK message to 6141
[P6143:T2:murmurd] debug: Sending ipc message to 6141
[P6141:i1:murmurd] debug: IPC worker: received IPC message from 6143: code=1 size=128 src=6143 dst=6141 seq=1
[P6141:i1:murmurd] debug: Sending ipc message to 6143
[P6143:T2:murmurd] debug: Waiting for response (seq = 1)
[P6143:i1:murmurd] debug: IPC worker started
[P6143:i1:murmurd] debug: IPC worker: received IPC message from 6141: code=2 size=80 src=6141 dst=6143 seq=1
[P6143:T2:murmurd] debug: Finished waiting for response (seq = 1, ret = 0)
[P6141:T1:murmurd] debug: ---- shim_exit_group (returning 0)
[P6143:T2:murmurd] debug: Shim process initialized
[P6141:i1:murmurd] debug: IPC worker: exiting worker thread
[P6141:T1:murmurd] debug: process 6141 exited with status 0
[P6143:T2:murmurd] debug: ipc send to 6141: IPC_MSG_LEASE
[P6143:T2:murmurd] debug: Sending ipc message to 6141
[P6143:T2:murmurd] debug: Waiting for response (seq = 2)
[P6143:i1:murmurd] debug: IPC leader disconnected
[P6143:i1:murmurd] debug: Unknown process (vmid: 0x17fd) disconnected

I will try to comment out these other functions, too but I think I have to do some implementations.

---- Edit 2 ----

I just tried to run the application in the foreground and a lot of messages came up, which could help to analyse the issue (still commented out capabilities):

error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
<X>2021-06-06 15:01:00.436 SSL: OpenSSL version is 'OpenSSL 1.1.1f  31 Mar 2020'
<W>2021-06-06 15:01:00.437 Initializing settings from /murmur.ini (basepath /)
<C>2021-06-06 15:01:00.437 Could not create AF_NETLINK socket (Function not implemented)      <- Why?
<W>2021-06-06 15:01:00.437 Meta: Unable to acquire list of network interfaces.
<W>2021-06-06 15:01:00.437 Meta: Could not determine IPv4/IPv6 support via network interfaces, assuming support for both.
<W>2021-06-06 15:01:00.753 MetaParams: TLS cipher preference is "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:AES256-SHA:AES128-SHA"
<C>2021-06-06 15:01:00.753 WARNING: You are running murmurd as root, without setting a uname in the ini file. This might be a security risk.
<W>2021-06-06 15:01:00.776 ServerDB: Opened SQLite database /murmur.sqlite
<W>2021-06-06 15:01:00.776 ServerDB: Using SQLite's default rollback journal.
<F>2021-06-06 15:01:00.776 SQL Error [PRAGMA ENCODING;]: disk I/O error Unable to execute statement   <- Why again?

Murmur trace starting from Warning

write(2, "<W>2021-06-06 17:16:19.455 Serve"..., 78<W>2021-06-06 17:16:19.455 ServerDB: Using SQLite's default rollback journal.
) = 78
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/home/developer/workspaces/graphene/Examples/mumble-voip/murmur.sqlite-journal", 0x7fff13f743e0) = -1 ENOENT (No such file or directory)
stat("/home/developer/workspaces/graphene/Examples/mumble-voip/murmur.sqlite-wal", 0x7fff13f743e0) = -1 ENOENT (No such file or directory)
fstat(10, {st_mode=S_IFREG|0640, st_size=110592, ...}) = 0
pread64(10, "SQLite format 3\0\20\0\1\1\0@  \0\0\0?\0\0\0\33"..., 4096, 0) = 4096
pread64(10, "\r\17\370\0\26\3\215\0\17x\17\317\17\25\16\303\16_\16\27\r`\f\260\fG\v\350\v2\n\205"..., 4096, 77824) = 4096
pread64(10, "\r\f\32\0\22\1\4\0\v\21\n\177\n\v\t6\10b\7\337\7\1\6R\5\213\4\250\3\262\3<"..., 4096, 81920) = 4096
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/home/developer/workspaces/graphene/Examples/mumble-voip/murmur.sqlite-journal", 0x7fff13f75020) = -1 ENOENT (No such file or directory)
pread64(10, "\0\0\0?\0\0\0\33\0\0\0\0\0\0\0\0", 16, 24) = 16
stat("/home/developer/workspaces/graphene/Examples/mumble-voip/murmur.sqlite-wal", 0x7fff13f75020) = -1 ENOENT (No such file or directory)
fstat(10, {st_mode=S_IFREG|0640, st_size=110592, ...}) = 0
pread64(10, "\n\0\0\0\2\17\337\0\17\353\17\337\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096
pread64(10, "\r\0\0\0\2\17\330\0\17\345\17\330\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4096) = 4096
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
brk(0x564bea5f4000)                     = 0x564bea5f4000
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f88209a33c0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 11
close(11)                               = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f881bbf5000
mprotect(0x7f881bbf6000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7f881c3f4ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[6439], tls=0x7f881c3f5700, child_tidptr=0x7f881c3f59d0) = 6439
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f881b3f4000
mprotect(0x7f881b3f5000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7f881bbf3ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[6440], tls=0x7f881bbf4700, child_tidptr=0x7f881bbf49d0) = 6440

Graphene trace starting from warning

<W>2021-06-06 15:18:20.952 ServerDB: Using SQLite's default rollback journal.
[P6450:T1:murmurd] trace: ---- return from shim_write(...) = 0x4e
[P6450:T1:murmurd] trace: ---- shim_fcntl(11, F_SETLK, 0x3e34ce3c97f0) = -38
[P6450:T1:murmurd] trace: ---- shim_gettimeofday(0x3e34ce3ca830, 0x0) = 0x0
[P6450:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P6450:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P6450:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P6450:T1:murmurd] trace: ---- shim_write(2, 0x3e34ce3c8230, 0x64) ...
<F>2021-06-06 15:18:20.952 SQL Error [PRAGMA ENCODING;]: disk I/O error Unable to execute statement
[P6450:T1:murmurd] trace: ---- return from shim_write(...) = 0x64
[P6450:T1:murmurd] trace: ---- shim_fstat(11, 0x3e34ce3ca510) = 0x0
[P6450:T1:murmurd] trace: ---- shim_stat("//murmur.sqlite", 0x3e34ce3ca5a0) = 0x0
[P6450:T1:murmurd] trace: ---- shim_close(11) = 0x0
[P6450:T1:murmurd] debug: ---- shim_exit_group (returning 1)
[P6450:i1:murmurd] debug: IPC worker: exiting worker thread
[P6450:T1:murmurd] debug: process 6450 exited with status 1

Thanks in advance.

boryspoplawski commented 3 years ago

Hi @boryspoplawski

I would like to implement a dummy syscall implementation. Do you have some documentation for it or could you point to some src files where I can see how to implement it?

The closes we have are process groups, the dummy implementation should be pretty similar, as we only alow for reading and storing them (they are not checked/enforced anywhere): https://github.com/oscarlab/graphene/blob/master/LibOS/shim/src/sys/shim_getuid.c#L65 But I wouldn't go with it, since commenting their usage out did not help.

Unfortunately, it's not enough to comment out the capabilities functions. The application stops again, but this time without a message.

And here it last lines before the applications stops

Unfortunately there's not much here, the process just exits with code 0...

I just tried to run the application in the foreground and a lot of messages came up, which could help to analyse the issue (still commented out capabilities):

AF_NETLINK is not currently supported in Graphene, that's why you see such message. That shouldn't be a problem tho and the application seems to go further anyway.

SQL Error - no idea, that's an error coming from the application itself, but from Graphene log it seems that the application is using file locking - again, this is not supported in Graphene currently, so probably that's why it fails (Graphene returns -38 which means -ENOSYS). The support for file locking is on the way, but it might still take a while.

dzygann commented 3 years ago

But I wouldn't go with it, since commenting their usage out did not help.

Yes, I agree.

SQL Error - no idea, that's an error coming from the application itself, but from Graphene log it seems that the application is using file locking - again, this is not supported in Graphene currently, so probably that's why it fails (Graphene returns -38 which means -ENOSYS). The support for file locking is on the way, but it might still take a while.

Yes, SQLite uses file locking. So this makes complete sense. Unfortunately, I don't know how to disable it, but I could exchange the SQLite DB with a MySQL DB. Maybe it's a little bit too late for me now, but in general, it should be possible to connect to a MySQL DB? Now, I get the following error:

[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_write(2, 0x165f96145320, 0xad) ...
<F>2021-06-06 22:45:17.581 ServerDB: Failed initialization: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) QMYSQL: Unable to connect
[P14301:T1:murmurd] trace: ---- return from shim_write(...) = 0xad
[P14301:T1:murmurd] debug: ---- shim_exit_group (returning 1)
[P14301:i1:murmurd] debug: IPC worker: exiting worker thread
[P14301:T1:murmurd] debug: process 14301 exited with status 1
boryspoplawski commented 3 years ago

This log is not enough, it just prints the error message. Actual cause of the error happened before these lines.

dzygann commented 3 years ago

It was too late, yesterday :)

[P14301:T1:murmurd] debug: adding a library for gdb: file:/usr/local/lib/x86_64-linux-gnu/graphene/runtime/glibc/libnss_db.so.2
[P14301:T1:murmurd] trace: ---- shim_mprotect(0x165f91819000, 0x1000, PROT_READ) ...
[P14301:T1:murmurd] trace: ---- return from shim_mprotect(...) = 0x0
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/var/db/services.db", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/etc/services", O_RDONLY|0x80000, 0000) = 0xb
[P14301:T1:murmurd] trace: ---- shim_newfstatat(11, "", 0x165f961476e0, 4096) = 0x0
[P14301:T1:murmurd] trace: ---- shim_lseek(11, 0x0, SEEK_SET) = 0x0
[P14301:T1:murmurd] trace: ---- shim_read(11, 0x165f92006210, 0x2000) ...
[P14301:T1:murmurd] trace: ---- return from shim_read(...) = 0x2000
[P14301:T1:murmurd] trace: ---- shim_close(11) = 0x0
[P14301:T1:murmurd] trace: ---- shim_rt_sigaction([SIGPIPE], 0x165f96147780, 0x165f96147820, 0x8) = 0x0
[P14301:T1:murmurd] trace: ---- shim_gettimeofday(0x165f96147b40, 0x0) = 0x0
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_getpid() = 0x1
[P14301:T1:murmurd] trace: ---- shim_socket(UNIX, STREAM, 0) = 0xb
[P14301:T1:murmurd] trace: ---- shim_connect(11, {family=UNIX,path=/var/run/mysqld/mysqld.sock}, 110) ...
[P14301:T1:murmurd] trace: ---- return from shim_connect(...) = -111
[P14301:T1:murmurd] trace: ---- shim_shutdown(11, 2) = -107
[P14301:T1:murmurd] trace: ---- shim_close(11) = 0x0
[P14301:T1:murmurd] trace: ---- shim_gettimeofday(0x165f96147920, 0x0) = 0x0
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P14301:T1:murmurd] trace: ---- shim_write(2, 0x165f96145320, 0xad) ...
<F>2021-06-06 22:45:17.581 ServerDB: Failed initialization: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) QMYSQL: Unable to connect
[P14301:T1:murmurd] trace: ---- return from shim_write(...) = 0xad
[P14301:T1:murmurd] debug: ---- shim_exit_group (returning 1)
[P14301:i1:murmurd] debug: IPC worker: exiting worker thread
[P14301:T1:murmurd] debug: process 14301 exited with status 1
dimakuv commented 3 years ago

Here is the problematic system call:

[P14301:T1:murmurd] trace: ---- shim_connect(11, {family=UNIX,path=/var/run/mysqld/mysqld.sock}, 110) ...
[P14301:T1:murmurd] trace: ---- return from shim_connect(...) = -111

The error code is 111 | ECONNREFUSED | Connection refused. (See https://mariadb.com/kb/en/operating-system-error-codes/)

Are you sure that your MySQL service is actually running on your machine? Do you see the file /var/run/mysqld/mysqld.sock? Do you have the right permissions to access it from the Graphene process?

dzygann commented 3 years ago

Hi the service is up and running and the mysqld.sock exists.

Do you have the right permissions to access it from the Graphene process?

What do I have to do to give Graphene the permissions?

dimakuv commented 3 years ago

What do I have to do to give Graphene the permissions?

Nothing, it's just from which user do you start Graphene. If this user has access to the mysqld.sock, then no need to do anything.

I just thought that maybe your murmurd application changes the user to some other user (e.g. to root) to connect to mysqld.sock? And you commented out this part (if I understood correctly), so now your murmurd doesn't have enough permissions to connect.

dzygann commented 3 years ago

I can start the application by root and by my local user. Both works. As root I get a additional warining the logs:

<C>2021-06-07 10:22:12.160 WARNING: You are running murmurd as root, without setting a uname in the ini file. This might be a security risk.

Here is a output from the murmur without Graphene trace:

rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7ff9c839a210}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
getpid()                                = 14926
socket(AF_UNIX, SOCK_STREAM, 0)         = 10
connect(10, {sa_family=AF_UNIX, sun_path="/var/run/mysqld/mysqld.sock"}, 110) = 0
setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = -1 EOPNOTSUPP (Operation not supported)
setsockopt(10, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
recvfrom(10, "[\0\0\0\n8.0.25-0ubuntu0.20.04.1\0\25\0\0"..., 16384, 0, NULL, NULL) = 95
stat("/usr/share/mysql/charsets/Index.xml", {st_mode=S_IFREG|0644, st_size=19474, ...}) = 0
openat(AT_FDCWD, "/usr/share/mysql/charsets/Index.xml", O_RDONLY) = 11
read(11, "<?xml version='1.0' encoding=\"ut"..., 19474) = 19474
close(11)  

I can even define that my local user should be used, but both does not work. Here is a snippet from the configuration:

; If Murmur is started as root, which user should it switch to?
; This option is ignored if Murmur isn't started with root privileges.
;uname=developer

I also think that something is blocking the connection, but what and why?

dimakuv commented 3 years ago

I have a suspicion. You have this trace:

[P14301:T1:murmurd] trace: ---- shim_connect(11, {family=UNIX,path=/var/run/mysqld/mysqld.sock}, 110) ...
[P14301:T1:murmurd] trace: ---- return from shim_connect(...) = -111

But you probably don't have the directory /var/run/mysqld/ mounted inside Graphene. Please add something like the following in your manifest file:

fs.mount.etc.type = "chroot"
fs.mount.etc.path = "/var/run/mysqld/"
fs.mount.etc.uri = "file:/var/run/mysqld/"
dzygann commented 3 years ago

You're right, I don't had the mount definition. But it is not sufficient. The message is still the same:

[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_getpid() = 0x1
[P15108:T1:murmurd] trace: ---- shim_socket(UNIX, STREAM, 0) = 0xb
[P15108:T1:murmurd] trace: ---- shim_connect(11, {family=UNIX,path=/var/run/mysqld/mysqld.sock}, 110) ...
[P15108:T1:murmurd] trace: ---- return from shim_connect(...) = -111
[P15108:T1:murmurd] trace: ---- shim_shutdown(11, 2) = -107
[P15108:T1:murmurd] trace: ---- shim_close(11) = 0x0
[P15108:T1:murmurd] trace: ---- shim_gettimeofday(0x16a40a228920, 0x0) = 0x0
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_openat(AT_FDCWD, "/home/developer/workspaces/graphene/LibOS/../Runtime/etc/localtime", O_RDONLY|0x80000, 0000) = -2
[P15108:T1:murmurd] trace: ---- shim_write(2, 0x16a40a226320, 0xad) ...
<F>2021-06-07 10:15:32.754 ServerDB: Failed initialization: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) QMYSQL: Unable to connect
[P15108:T1:murmurd] trace: ---- return from shim_write(...) = 0xad
[P15108:T1:murmurd] debug: ---- shim_exit_group (returning 1)
[P15108:i1:murmurd] debug: IPC worker: exiting worker thread
[P15108:T1:murmurd] debug: process 15108 exited with status 1

Do I have to add some other property like one of these:

loader.env.PATH = "{{ r_home }}/bin:/usr/bin:/bin"
loader.env.USERNAME = ""
loader.env.HOME = ""
loader.env.PWD = ""
loader.env.R_ARCH = ""
loader.env.R_HOME = "{{ r_home }}"
boryspoplawski commented 3 years ago

Wait, don't mount /var/run/mysqld inside Graphene, you are not supposed to share it between Graphene and outside (I mean, you can, if that's what you really want to do...) What is creating your local MySQL server? That should probably be run inside Graphene too.

Actaully, if the path wasn't mounted inside Graphene, it would fail with ENOENT, not ECONNREFUSED, so probably there is something off with your setup - it seems you already had "/var/run/mysqld/" mounted previously.

BTW, you can create temporary filesystem with:

fs.mount.tmpfs.type = "tmpfs"
fs.mount.tmpfs.path = "/var/run/mysqld"
fs.mount.tmpfs.uri = "file:dummy-unused-by-tmpfs-uri"
dzygann commented 3 years ago

What is creating your local MySQL server? That should probably be run inside Graphene too.

The server is running on my local machine. What I have to do to run the MySQL server inside Graphene?

it seems you already had "/var/run/mysqld/" mounted previously.

I had mount /var, only. Does it work recursively through the file tree?

dzygann commented 3 years ago

Hi, I could solve this issue by using the host 127.0.0.1 instead of localhost.

Now, the next issue came up with error code -38 again:

[P16873:T1:murmurd] trace: ---- return from shim_read(...) = 0x5
[P16873:T1:murmurd] trace: ---- shim_read(11, 0x279a906ba4e8, 0x29) ...
[P16873:T1:murmurd] trace: ---- return from shim_read(...) = 0x29
[P16873:T1:murmurd] trace: ---- shim_socket(INET6, SOCK_NONBLOCK|SOCK_CLOEXEC|STREAM, 0) = 0x13
[P16873:T1:murmurd] trace: ---- shim_setsockopt(19, 1, 10, 0x279a948178dc, 4) = 0x0
[P16873:T1:murmurd] trace: ---- shim_setsockopt(19, 1, 2, 0x279a9481794c, 4) = 0x0
[P16873:T1:murmurd] trace: ---- shim_setsockopt(19, 41, 26, 0x279a948178b8, 4) = 0x0
[P16873:T1:murmurd] trace: ---- shim_bind(19, 0x279a948178c0, 28) = 0x0
[P16873:T1:murmurd] trace: ---- shim_getsockname(19, 0x279a948178b0, 0x279a9481785c) = 0x0
[P16873:T1:murmurd] trace: ---- shim_socket(INET, SOCK_CLOEXEC|DGRAM, 0) = 0x14
[P16873:T1:murmurd] trace: ---- shim_ioctl(20, OP 0x8910, 0x279a948177d0) ...
[P16873:T1:murmurd] trace: ---- return from shim_ioctl(...) = -38                                    <- ioctl returns -38
[P16873:T1:murmurd] trace: ---- shim_close(20) = 0x0
[P16873:T1:murmurd] error: Internal memory fault at 0x00000009 (IP = +0x63a30, VMID = 16873, TID = 1)

Here is the trace without Graphene. It seems something is wrong with getpeername:

[pid 14926] sendto(10, "\t\0\0\0\3\0\1COMMIT", 13, 0, NULL, 0) = 13
[pid 14926] recvfrom(10, "\24\0\0\1\0\0\0\2@\0\0\0\v\5\t\10________", 16384, 0, NULL, NULL) = 24
[pid 14926] socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 18
[pid 14926] setsockopt(18, SOL_SOCKET, SO_OOBINLINE, [1], 4) = 0
[pid 14926] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 14926] setsockopt(18, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 14926] bind(18, {sa_family=AF_INET6, sin6_port=htons(64738), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0
[pid 14926] getsockname(18, {sa_family=AF_INET6, sin6_port=htons(64738), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0
[pid 14926] getsockopt(18, SOL_IPV6, IPV6_V6ONLY, [0], [4]) = 0
[pid 14926] getpeername(18, 0x7ffce37ae630, [28]) = -1 ENOTCONN (Transport endpoint is not connected)     <- It seems that this one causes the issue
[pid 14926] getsockopt(18, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
[pid 14926] listen(18, 50)              = 0
[pid 14926] write(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 14926] sendto(10, "\r\0\0\0\3\0\1BEGIN WORK", 17, 0, NULL, 0) = 17
[pid 14926] recvfrom(10, "\24\0\0\1\0\0\0\3@\0\0\0\v\5\t\10T_______", 16384, 0, NULL, NULL) = 24
[pid 14926] sendto(10, ";\0\0\0\26INSERT INTO `murmur_slog` ("..., 63, 0, NULL, 0) = 63
[pid 14926] recvfrom(10, "\f\0\0\1\0/\0\0\0\0\0\2\0\0\0\0\27\0\0\2\3def\0\0\0\1?\0\f?"..., 16384, 0, NULL, NULL) = 70
[pid 14926] sendto(10, "\5\0\0\0\32/\0\0\0", 9, 0, NULL, 0) = 9
[pid 14926] recvfrom(10, "\7\0\0\1\0\0\0\3\0\0\0", 16384, 0, NULL, NULL) = 11
[pid 14926] sendto(10, "9\0\0\0\27/\0\0\0\0\1\0\0\0\2\0\1\3\0\0\376\0\0\1\0\0\0!Serv"..., 61, 0, NULL, 0) = 61
[pid 14926] recvfrom(10, "\7\0\0\1\0\1\0\3\0\0\0", 16384, 0, NULL, NULL) = 11
[pid 14926] sendto(10, "\5\0\0\0\31/\0\0\0", 9, 0, NULL, 0) = 9
[pid 14926] sendto(10, "\t\0\0\0\3\0\1COMMIT", 13, 0, NULL, 0) = 13
[pid 14926] recvfrom(10, "\24\0\0\1\0\0\0\2@\0\0\0\v\5\t\10________", 16384, 0, NULL, NULL) = 24
[pid 14926] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
[pid 14926] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
[pid 14926] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
[pid 14926] write(2, "<W>2021-06-07 10:18:08.927 1 => "..., 66<W>2021-06-07 10:18:08.927 1 => Server listening on 0.0.0.0:64738
) = 66
[pid 14926] getsockname(18, {sa_family=AF_INET6, sin6_port=htons(64738), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [128->28]) = 0
[pid 14926] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 19
[pid 14926] setsockopt(19, SOL_IP, IP_PKTINFO, [1], 4) = 0
[pid 14926] setsockopt(19, SOL_IPV6, IPV6_RECVPKTINFO, [1], 4) = 0
[pid 14926] getsockopt(18, SOL_IPV6, IPV6_V6ONLY, [0], [4]) = 0
[pid 14926] setsockopt(19, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 14926] bind(19, {sa_family=AF_INET6, sin6_port=htons(64738), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0
[pid 14926] setsockopt(19, SOL_IP, IP_TOS, [224], 4) = 0
[pid 14926] getsockopt(19, SOL_SOCKET, SO_PRIORITY, [0], [4]) = 0
[pid 14926] setsockopt(19, SOL_SOCKET, SO_PRIORITY, [6], 4) = 0
[pid 14926] write(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 14926] socketpair(AF_UNIX, SOCK_STREAM, 0, [20, 21]) = 0
[pid 14926] sendto(10, "\r\0\0\0\3\0\1BEGIN WORK", 17, 0, NULL, 0) = 17

Do you have an idea what's wrong and how I can solve it?

dimakuv commented 3 years ago

@dzygann This is a tough one. In particular, this error is not good:

[P16873:T1:murmurd] error: Internal memory fault at 0x00000009 (IP = +0x63a30, VMID = 16873, TID = 1)

It's unclear what happens in your application. I don't see anything particularly bad (ioctl with -38) doesn't seem to be important. I also see that the two logs (Graphene one and the native one) look quite different...

I think it could be time for debugging with GDB :( If you know how to GDB, then doing it with Graphene will be pretty straight-forward: you need to re-build Graphene in debug mode (meson build --buildtype=debug and all make invocations prepended with DEBUG=1). And then you just run under GDB like this: GDB=1 graphene-direct ./murmurd ....

dzygann commented 3 years ago

Hi, I followed the steps fromhttps://graphene.readthedocs.io/en/latest/devel/debugging.html#debugging-without-sgx-support and adjust the build meson build part like the following:

meson build -Ddirect=enabled --buildtype=debug --reconfigure

I get the following output if I run graphene with murmur:

GDB=1 graphene-direct murmurd -fg
Reading symbols from /usr/local/lib/x86_64-linux-gnu/graphene/direct/loader...
[graphene_linux_gdb.py] Loading common/pagination_gdb.py...
[graphene_linux_gdb.py] Loading common/debug_map_gdb.py...
[graphene_linux_gdb.py] Loading common/graphene.gdb...
(gdb) 

And that's it. Nothing else happens. Do I have to do further steps or did I do a mistake?

dimakuv commented 3 years ago

And that's it. Nothing else happens. Do I have to do further steps or did I do a mistake?

There is no mistake, you entered the GDB interactive debugging session :) That's how debugging in GDB works. At this point you're supposed to debug the program (by putting breakpoints and watchpoints and then typing run). But from your question looks like you haven't debugged with GDB previously.

Unfortunately, at this point I can't help more. The bug you hit looks pretty bad (and I don't know if it's the bug in murmurd or in Graphene). And the only way I can imagine to root-cause it is to debug with GDB. Unfortunately, debugging is a pretty hard thing to do, but if you want to educate yourself, you can google for "gdb debugging tutorial".

dzygann commented 3 years ago

Okay, I get the following output when I run murmur with Graphene-direct mode and enabled GDB:

Thread 1 "loader" received signal SIGSEGV, Segmentation fault.
0x00007fa7dd80ca30 in __socket_is_ipv6_v6only (hdl=0x7fa7dd5e3738) at ../LibOS/shim/src/sys/shim_socket.c:422
422         if (o->level == IPPROTO_IPV6 && o->optname == IPV6_V6ONLY) {
(gdb) n
handle_sync_signal (signum=0, info=0x0, uc=0x0) at ../Pal/src/host/Linux/db_exception.c:84
84  static void handle_sync_signal(int signum, siginfo_t* info, struct ucontext* uc) {
(gdb) n
85      int event = get_pal_event(signum);
(gdb) n
86      assert(event > 0);
(gdb) n
88      uintptr_t rip = ucontext_get_ip(uc);
(gdb) n
89      if (!ADDR_IN_PAL_OR_VDSO(rip)) {
(gdb) n
91          perform_signal_handling(event, /*is_in_pal=*/false, (PAL_NUM)info->si_addr, uc);
(gdb) n
[P2012:T1:murmurd] error: Internal memory fault at 0x00000009 (IP = +0x63a30, VMID = 2012, TID = 1)
[LWP 2020 exited]
[LWP 2018 exited]
[LWP 2017 exited]
[LWP 2016 exited]
[LWP 2012 exited]
warning: Error removing breakpoint 0
warning: Error removing breakpoint 0
warning: Error removing breakpoint 0
[Inferior 1 (process 2012) exited with code 01]

Here is a little bit more context:


(gdb) where
#0  0x00007fec9fc85a30 in __socket_is_ipv6_v6only (hdl=0x7fec9fa5c738) at ../LibOS/shim/src/sys/shim_socket.c:422
gramineproject/graphene#1  0x00007fec9fc8a12c in shim_do_getsockopt (fd=19, level=41, optname=26, optval=0x4b051180a7b0 "", optlen=0x4b051180a7b4)
    at ../LibOS/shim/src/sys/shim_socket.c:2118
gramineproject/graphene#2  0x00007fec9fc6c20c in shim_emulate_syscall (context=0x7fec9fbdff38) at ../LibOS/shim/src/shim_syscalls.c:36
gramineproject/graphene#3  0x00007fec9fc8de38 in syscalldb () at ../LibOS/shim/src/arch/x86_64/syscallas.S:173
gramineproject/graphene#4  0x00007fec9fc8de3b in __morestack () at ../LibOS/shim/src/arch/x86_64/syscallas.S:204
gramineproject/graphene#5  0x00004b050ff2538b in getsockopt () at ../sysdeps/unix/syscall-template.S:120
gramineproject/graphene#6  0x00004b05108708a1 in ?? ()
gramineproject/graphene#7  0x00004b051086dd00 in ?? ()
gramineproject/graphene#8  0x00004b0510863194 in QTcpServer::listen(QHostAddress const&, unsigned short) ()
gramineproject/graphene#9  0x00004b05115e84f4 in Server::Server(int, QObject*) ()
gramineproject/graphene#10 0x00004b05115cbd12 in Meta::boot(int) ()
gramineproject/graphene#11 0x00004b05115cc182 in Meta::bootAll() ()
gramineproject/graphene#12 0x00004b051156ebc8 in main.cold ()
gramineproject/graphene#13 0x00004b050fe4ed76 in __libc_start_main (main=0x4b0511599d30 <main>, argc=2, argv=0x4b051180af98, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x4b051180af88) at ../csu/libc-start.c:332
gramineproject/graphene#14 0x00004b051159e6ce in _start ()

(gdb) p o
$1 = (struct shim_sock_option *) 0x1
(gdb) p o->level
Cannot access memory at address 0x9
(gdb) p o->optname
Cannot access memory at address 0xd

If I'm not wrong this points to Graphene, isn't it? The method __socket_is_ipv6_v6only gets a struct shim_handle* hdl. The object itself seems to be right because the asserts are not violated. But the iteration over the pending options level or optname seems to be broken or both. Both cannot be accessed at all or by GDB. Do you have an idea why?

dimakuv commented 3 years ago

Thanks for the debugging session!

There is definitely a bug inside Graphene. You have a o object pointer pointing to 0x1 which is not valid memory. Looking at the sock->pending_options code, it looks this sub-object is freed but then later re-used (so, the case of a dangling pointer): https://github.com/oscarlab/graphene/blob/bd60bd0a42f0f29fd084d3ed3f50da64447b8595/LibOS/shim/src/sys/shim_socket.c#L1897-L1910.

I will try to debug and fix this soon-ish. Thanks again for reporting this.

dzygann commented 3 years ago

Hi @dimakuv , If you have an idea how to fix this one, even with a workaround, could you share this knowledge with me? I would implement it on my local clone. I would like to know if, mumble is working after all :)

Thanks in advance.

dimakuv commented 3 years ago

Sorry, I have plans to debug and submit a fix for this. It's non-trivial to circumvent this error... I don't have a good idea how to simply circumvent this. Maybe try different network settings in your Mumble application, if possible at all? Like different IP addresses (domain name vs IPv4 vs IPv6), different ports, etc. This may skip this code path, and you won't hit this error. But I'm doubtful.

dzygann commented 3 years ago

Hi

Sorry, I have plans to debug and submit a fix for this.

I look forward to the bug fix :) If you need some further input how I could produce the issue, let me know.

I will try your suggestions and give a feedback if one of these will work for me.

dimakuv commented 3 years ago

@dzygann Could you check https://github.com/oscarlab/graphene/pull/2495?

dzygann commented 3 years ago

Hi @dimakuv, I could start murmur without the issue anymore.

Many thanks!!!!

dimakuv commented 3 years ago

We have a PR in the old repo: https://github.com/gramineproject/graphene/pull/2495

The changes in that PR are quite trivial, but we don't have the bandwidth to re-create the PR in the new Gramine repo. We also don't have the bandwidth to review this PR right now.

So someone will re-create that old PR in this new repo at some point later. Transferred only the issue right now.

boryspoplawski commented 2 years ago

@dzygann Please verify that https://github.com/gramineproject/gramine/pull/579 fixes the issue