gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
588 stars 193 forks source link

gRPC client/server application failure #387

Closed ying2liu closed 2 years ago

ying2liu commented 2 years ago

Description of the problem

I observed failure when running gRPC client/server application using gramine sgx. This application works well in native and gramine without sgx. I started the server side first and it would listen at localhost 127.0.0.1:50051, then ran client side continuously for 40 times. I received several error messages as the following:

rpc error: code = DeadlineExceeded desc = context deadline exceeded

The failure happened randomly. I used the latest Gramine source code. Here is the gRPC source code

Steps to reproduce

  1. Build the greeter_server:

    cd grpc-go/examples/helloworld/greeter_server
    go mod init server
    go mod tidy
    [server_debug_new.log](https://github.com/gramineproject/gramine/files/8054548/server_debug_new.log)
    
    go build
  2. Copy server binary to a folder and build server using gramine
  3. Run this server using graphene-sgx:
    graphene-sgx server
  4. Build and run the greeter_client:
    cd grpc-go/examples/helloworld/greeter_client
    go mod init client
    go mod tidy
    go build
  5. run the client side outside the TEE for 40 times.
    #!/bin/bash
    for a in {1..40}
    do
       ./client
    done

Expected Results

server side:

2022/02/10 22:03:13 server listening at 127.0.0.1:50051
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:32 Received: world
2022/02/10 22:03:33 Received: world
2022/02/10 22:03:33 Received: world

client side:

2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:32 Greeting: Hello world
2022/02/10 22:03:33 Greeting: Hello world
2022/02/10 22:03:33 Greeting: Hello world
2022/02/10 22:03:33 Greeting: Hello world

Actual Results

server side:

2022/02/11 09:13:24 server listening at 127.0.0.1:50051
2022/02/11 09:13:31 Received: world
2022/02/11 09:13:31 Received: world
2022/02/11 09:13:31 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:32 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:33 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:38 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world
2022/02/11 09:13:40 Received: world

client side:

2022/02/11 09:13:31 Greeting: Hello world
2022/02/11 09:13:31 Greeting: Hello world
2022/02/11 09:13:31 Greeting: Hello world
2022/02/11 09:13:32 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:32 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:33 Greeting: Hello world
2022/02/11 09:13:34 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:35 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:36 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:37 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:38 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:38 Greeting: Hello world
2022/02/11 09:13:39 could not greet: rpc error: code = DeadlineExceeded desc = context deadlin                                                                                                      e exceeded
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world
2022/02/11 09:13:40 Greeting: Hello world

Additional Information

Please find the manifest, make and log files attached. Makefile.zip server.manifest.template.zip server_debug.log

Gramine commit hash

dbddd90bb51d3c1feff04fc5387ea37073e9321e

server_debug_new.log

mkow commented 2 years ago

@ying2liu: Please use "Preview" option on GitHub before posting, this issue is completely unreadable in its current state.

dimakuv commented 2 years ago

Do I understand correctly that @boryspoplawski was looking into this (maybe not exactly this gRPC issue, but something similar)? Is there any update?

mkow commented 2 years ago

@dimakuv: We had similar issues in the past (e.g. https://github.com/gramineproject/gramine/issues/323), but I think that this "context deadline exceeded" is just a generic timeout on something, and the cause may be unrelated to the other issues.

Gramine commit hash

9e1d1f6

That's not true, the logs say that you're running a pretty old Gramine, even from before epoll rework (which will probably fix this issue). Closing this issue as outdated and will reopen if it turns out that the current master is also broken.

ying2liu commented 2 years ago

@mkow I have collected new log using the latest code. Could you please reopen this issue?

bigdata-memory commented 2 years ago

I saw the similar problem when we were benchmarking the Hashicorp Vault in Gramine-sgx. The corresponding vault message was context deadline exceed. It was quite unpredictable and difficult to sort it out.

lejunzhu commented 2 years ago

@ying2liu just my 2 cents, if you reduce the log level from "all" to "error" in the manifest, will that make the result different?

ying2liu commented 2 years ago

@lejunzhu It makes no difference. Even if I change the log level to "error", I still get the same failure.

mkow commented 2 years ago

Because you're still running old Gramine.

ying2liu commented 2 years ago

@mkow I ran the test using yesterday's code (dbddd90bb51d3c1feff04fc5387ea37073e9321e)

ying2liu commented 2 years ago

@mkow If gRPC application works on your system, please share the commit hash. I could try to use that code base.

mkow commented 2 years ago

The logs you included are from some old version, not https://github.com/gramineproject/gramine/commit/dbddd90bb51d3c1feff04fc5387ea37073e9321e.

ying2liu commented 2 years ago

The log is from the same code base. commit dbddd90bb51d3c1feff04fc5387ea37073e9321e Author: Kailun Qin kailun.qin@intel.com Date: Fri Feb 11 09:05:24 2022 -0500 [Pal] Add missing frees for toml_string_in()-allocated strings

From log file, you could find: [::] debug: Gramine was built from commit: dbddd90bb51d3c1feff04fc5387ea37073e9321e [::] debug: Host: Linux-SGX [::] debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7

mkow commented 2 years ago

I just downloaded the log and it says:

[...]
[::] debug: Gramine was built from commit: 66881cca331402825cf1b5f8c4f949a2c758892b
[::] debug: Host: Linux-SGX
[::] debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
[...]
ying2liu commented 2 years ago

I don't know why we downloaded the different log file, but I uploaded the log file again using a different name: server_debug_new.log

boryspoplawski commented 2 years ago

@ying2liu both logs you've uploaded are exactly the same.

ying2liu commented 2 years ago

Yes, they are the same log file using commit: dbddd90bb51d3c1feff04fc5387ea37073e9321e Did you see the log for this commit?

boryspoplawski commented 2 years ago

No, they are both for 66881cca

Gramine was built from commit: 66881cca331402825cf1b5f8c4f949a2c758892b

ying2liu commented 2 years ago

@boryspoplawski It is strange. I just attached the log file with the email sent to you. Please check if it is the correct one.

boryspoplawski commented 2 years ago

@ying2liu I did not get any emails

dimakuv commented 2 years ago

I also see that both server_debug.log and server_debug_new.log both use the Gramine commit 66881cca331402825cf1b5f8c4f949a2c758892b.

This commit is from beginning of January:

commit 66881cca331402825cf1b5f8c4f949a2c758892b
Author: Michał Kowalczyk <mkow@invisiblethingslab.com>
Date:   Thu Jan 6 00:26:16 2022 +0100

    Change num_* to *_cnt naming for consistency

Which is quite old (we merged epoll refactoring after this date).

@ying2liu It looks like you incorrectly install the latest version of Gramine on your system? Maybe it's because you have a system-wide Gramine binaries "shadowed" by the locally-installed Gramine binaries. Please do which gramine-sgx -- does it show the expected path to Gramine?

dimakuv commented 2 years ago

@ying2liu It came to my attention that server_debug.log and server_debug_new.log have logs from several different Gramine runs. So, these log files were simply appended to. Could you attach the fresh log containing only a single run of the latest version of Gramine?

dimakuv commented 2 years ago

I will try to reproduce this issue now.

dimakuv commented 2 years ago

I was able to reproduce the issue.

The problem is that this server Go app is statically built and thus uses raw syscall instructions. So gramine-sgx has to trap these instructions every time, which is slow. What is worse, this Go app (or more particularly, the gRPC implementation in Golang) uses many-many syscalls per client request, with the worst offenders being clock_gettime() and futex().

So the root cause is that the server is very slow. But the client Go app has a deadline of 1 second -- which the developers of this app assumed to be more than enough since this workload runs on localhost (both apps on the same machine).

So we cannot do much about terrible performance of the server app, so I worked around the problem by modifying the client code -- increasing from 1 second timeout to 5 second timeout:

~/gramineproject/gramine/CI-Examples/go/grpc-go/examples/helloworld/greeter_client$ git diff
diff --git a/examples/helloworld/greeter_client/main.go b/examples/helloworld/greeter_client/main.go
index 4529069..211d74c 100644
--- a/examples/helloworld/greeter_client/main.go
+++ b/examples/helloworld/greeter_client/main.go
@@ -50,7 +50,7 @@ func main() {
        c := pb.NewGreeterClient(conn)

        // Contact the server and print out its response.
-       ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
        defer cancel()
        r, err := c.SayHello(ctx, &pb.HelloRequest{Name: *name})
        if err != nil {
dimakuv commented 2 years ago

I'm closing this issue, since the root cause was analyzed -- see my previous comment.

ying2liu commented 2 years ago

@dimakuv Could you please keep this case open so that we could track gramine and the workload performance?

mkow commented 2 years ago

This is a known limitation (that static go binaries are slow under Gramine), I don't see a point in keeping it open.

lejunzhu commented 2 years ago

@ying2liu if you change the manifest like this, do you still get the same error? I tried it on a Xeon Gold.

sgx.enclave_size = "8G"
sgx.thread_num = 128
#sys.stack.size = "256M"
sgx.preheat_enclave = true

You can also add timing to the client side code, to see if there is anything that does not hit the deadline, but very close:

        start := time.Now()
        r, err := c.SayHello(ctx, &pb.HelloRequest{Name: *name})
        if err != nil {
                log.Fatalf("could not greet: %v", err)
        }
        end := time.Now()
        dur := end.Sub(start)
        log.Printf("Greeting: %s. Elapsed: %f", r.GetMessage(), dur.Seconds())
dimakuv commented 2 years ago

sgx.preheat_enclave = true

This is a very good idea. It should help significantly with the initialization phase and the very-first client requests.

ying2liu commented 2 years ago

@lejunzhu Thank you so much for your suggestion. I will give it a try.

ying2liu commented 2 years ago

@lejunzhu This manifest change works. The test could complete successfully without any timeout error. Thanks!