gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support
https://grapheneproject.io
GNU Lesser General Public License v3.0
767 stars 261 forks source link

How to debug crashing application reading pipe (13 eacces) #1463

Closed emilhem closed 4 years ago

emilhem commented 4 years ago

An application we're trying to port to Graphene SGX crashes when it's trying to read a pipe.

Excerpt from debug output

[     1] creating pipe: pipe.srv:3ccf7db667d7ecebb192f6da3ca2eb4520e69b42e6ea1a665ab5d8b1edb62bbd
[     1] ---- shim_pipe([10, 11]) = 0
[     1] ---- shim_fcntl(10,F_GETFL,0x0) = 0
[     1] ---- shim_fcntl(10,F_SETFL,0x800) = 0
[     1] ---- shim_fcntl(11,F_GETFL,0x0) = 1
[     1] ---- shim_fcntl(11,F_SETFL,0x801) = 0
[     1] ---- shim_read(10,0x80606ba90,32) ... read
[     1] ---- return from shim_read(...) = -13

The application then crashes because of the failure reading the pipe (file descriptor 10).

The question is, how can we debug eacces on pipes? The application doesn't fail like this without Graphene SGX.

boryspoplawski commented 4 years ago

If you are familiar with gdb, you can build Graphene in debug mode (just add DEBUG=1 to make command) and run your application with GDB=1 which will open gdb with your application loaded and allow you to debug it (parts inside enclave as well, including Graphene itself).

If this application is something you can share, I could probably help you with debugging.

dimakuv commented 4 years ago

@emilhem I think I know your issue.

You're marking the created pipe ends as non-blocking. I had a bug in Graphene which I fixed as part of this PR: https://github.com/oscarlab/graphene/pull/1454. In particular, this: https://github.com/oscarlab/graphene/pull/1454/files#diff-befd1a303a9d29a9f5feea57d94d5927. We used to return "PAL_ERROR_DENIED" (which is translated to EACCES) on any unsuccessful read/write on a pipe. This is of course wrong for non-blocking pipes -- they should return EAGAIN or something similar if there is nothing to read on the pipe.

So please try https://github.com/oscarlab/graphene/pull/1454 with your application. It should work then. This PR will be merged soon (this week I hope). Please report back if this solves your issue.

emilhem commented 4 years ago

So please try #1454 with your application. It should work then. This PR will be merged soon (this week I hope). Please report back if this solves your issue.

Definitely an improvement. The application got much farther before crashing again with the same error on the same file descriptor, but on a different pid (since the application forks a lot).

[    55] ---- shim_read(10,0x801ecfd40,32) ... read
[    50] ---- shim_sched_yield() = 0
[     1] ---- shim_select(0,0x0,0x0,0x0,0x0) = -22
[     1] ---- shim_select(0,0x0,0x0,0x0,0x0) = -22
[    50] ---- shim_sched_yield() = 0
[    55] ---- return from shim_read(...) = -13

The same place as my original issue returned the following instead (probably irrelevant for this issue):

[     1] creating pipe: pipe.srv:356995fa7278abbecabe71960d75af6cda7c5977c06cab5248966d642afd2859
[     1] ---- shim_pipe([10, 11]) = 0
[     1] ---- shim_fcntl(10,F_GETFL,0x0) = 0
[     1] ---- shim_fcntl(10,F_SETFL,0x800) = 0
[     1] ---- shim_fcntl(11,F_GETFL,0x0) = 1
[     1] ---- shim_fcntl(11,F_SETFL,0x801) = 0
[     1] ---- shim_read(10,0x808f12900,32) ... read
[     1] ---- return from shim_read(...) = -11

The shim_read returned -11 instead.

emilhem commented 4 years ago

If this application is something you can share, I could probably help you with debugging.

The application is open source (Erlang/OTP) but since I'm bounded by a NDA I have to wait for approval first. Even though it's only Makefiles and manifests...

emilhem commented 4 years ago

A reoccurring error that is very visible when not using debug mode (due to massive outputs) is:

restore_checkpoint() at vma (-13)
shim_init() in restore_checkpoint (-13)
Saturation error in exit code -13, getting rounded down to 243
dimakuv commented 4 years ago

Ok, you have several unrelated issues here. Let's go through all of them:

  1. shim_read(...) = -11 is benign. -11 = EAGAIN (see e.g. https://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html). This is normal behavior, the application is supposed to try again or go do something else before retrying later.

  2. [ 55] ---- return from shim_read(...) = -13 -- this is interesting. -13 = EACCES. This probably means that the other end of the pipe broke or closed. Could you trace where this fd = 10 comes from and with which other process it wants to communicate? The other lines in your output seem irrelevant to this pipe, you'll need to weed out these irrelevant lines.

  3. restore_checkpoint() at vma (-13) -- this is our infamous error that we currently recommend to side-step (it's currently unclear how to better fix it). What most probably happens is your child process has smaller enclave size than the parent: the parent process checkpoints its state and sends this checkpoint to the child, but the child cannot reinstate it because the child doesn't have enough memory! (See https://github.com/oscarlab/graphene-tests/pull/14.) So please use the same enclave size for all your executables.

I suggest to start with fixing issue 3. It may be that you see issue 2 because of issue 3 (the child process failed, so the other end of the pipe closed unexpectedly, and your end of the pipe now sees -13).

emilhem commented 4 years ago
  1. Alright

  2. Not sure if I'm experienced enough to reach that point in the execution to debug it.

  3. They are the same size

I've gotten approval to share the Makefile and manifest here: https://github.com/emilhem/graphene-erlang

dimakuv commented 4 years ago

@emilhem I started looking into it, thanks for the GitHub repo! Also check my issues I opened in that repo.

So I found and fixed two bugs thanks to this Erlang workload: https://github.com/oscarlab/graphene/pull/1483.

At this point, there is some ECONNRESET error on one of the pipes, and I cannot go further without understanding Erlang sources... One obvious suspect is timerfd which is not implemented in Graphene and looks important for correct Erlang behavior. Do you know how critical timerfd is for Erlang?

emilhem commented 4 years ago

Just so it doesn't look like I'm ignoring dimakuv here, we've been discussing Erlang in the repo issues here: https://github.com/emilhem/graphene-erlang/issues?q=is%3Aissue

To summarize, timerfd can be disabled (the consequences for doing this is unclear right now).

emilhem commented 4 years ago

Unsure whether this issue should be closed or not. Some of the issues are resolved and the Multi-Consumer/Multi-Producer pipes in Erlang are not a problem in Graphene itself.

dimakuv commented 4 years ago

Let's close this.