Closed ying2liu closed 2 years ago
@ying2liu: Please use "Preview" option on GitHub before posting, this issue is completely unreadable in its current state.
Do I understand correctly that @boryspoplawski was looking into this (maybe not exactly this gRPC issue, but something similar)? Is there any update?
@dimakuv: We had similar issues in the past (e.g. https://github.com/gramineproject/gramine/issues/323), but I think that this "context deadline exceeded" is just a generic timeout on something, and the cause may be unrelated to the other issues.
Gramine commit hash
That's not true, the logs say that you're running a pretty old Gramine, even from before epoll rework (which will probably fix this issue). Closing this issue as outdated and will reopen if it turns out that the current master is also broken.
@mkow I have collected new log using the latest code. Could you please reopen this issue?
I saw the similar problem when we were benchmarking the Hashicorp Vault in Gramine-sgx.
The corresponding vault message was context deadline exceed
.
It was quite unpredictable and difficult to sort it out.
@ying2liu just my 2 cents, if you reduce the log level from "all" to "error" in the manifest, will that make the result different?
@lejunzhu It makes no difference. Even if I change the log level to "error", I still get the same failure.
Because you're still running old Gramine.
@mkow I ran the test using yesterday's code (dbddd90bb51d3c1feff04fc5387ea37073e9321e)
@mkow If gRPC application works on your system, please share the commit hash. I could try to use that code base.
The logs you included are from some old version, not https://github.com/gramineproject/gramine/commit/dbddd90bb51d3c1feff04fc5387ea37073e9321e.
The log is from the same code base.
commit dbddd90bb51d3c1feff04fc5387ea37073e9321e
Author: Kailun Qin kailun.qin@intel.com
Date: Fri Feb 11 09:05:24 2022 -0500
[Pal] Add missing frees for toml_string_in()
-allocated strings
From log file, you could find: [::] debug: Gramine was built from commit: dbddd90bb51d3c1feff04fc5387ea37073e9321e [::] debug: Host: Linux-SGX [::] debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
I just downloaded the log and it says:
[...]
[::] debug: Gramine was built from commit: 66881cca331402825cf1b5f8c4f949a2c758892b
[::] debug: Host: Linux-SGX
[::] debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
[...]
I don't know why we downloaded the different log file, but I uploaded the log file again using a different name: server_debug_new.log
@ying2liu both logs you've uploaded are exactly the same.
Yes, they are the same log file using commit: dbddd90bb51d3c1feff04fc5387ea37073e9321e Did you see the log for this commit?
No, they are both for 66881cca
Gramine was built from commit: 66881cca331402825cf1b5f8c4f949a2c758892b
@boryspoplawski It is strange. I just attached the log file with the email sent to you. Please check if it is the correct one.
@ying2liu I did not get any emails
I also see that both server_debug.log
and server_debug_new.log
both use the Gramine commit 66881cca331402825cf1b5f8c4f949a2c758892b
.
This commit is from beginning of January:
commit 66881cca331402825cf1b5f8c4f949a2c758892b
Author: Michał Kowalczyk <mkow@invisiblethingslab.com>
Date: Thu Jan 6 00:26:16 2022 +0100
Change num_* to *_cnt naming for consistency
Which is quite old (we merged epoll refactoring after this date).
@ying2liu It looks like you incorrectly install the latest version of Gramine on your system? Maybe it's because you have a system-wide Gramine binaries "shadowed" by the locally-installed Gramine binaries. Please do which gramine-sgx
-- does it show the expected path to Gramine?
@ying2liu It came to my attention that server_debug.log
and server_debug_new.log
have logs from several different Gramine runs. So, these log files were simply appended to. Could you attach the fresh log containing only a single run of the latest version of Gramine?
I will try to reproduce this issue now.
I was able to reproduce the issue.
The problem is that this server
Go app is statically built and thus uses raw syscall
instructions. So gramine-sgx
has to trap these instructions every time, which is slow. What is worse, this Go app (or more particularly, the gRPC implementation in Golang) uses many-many syscalls per client request, with the worst offenders being clock_gettime()
and futex()
.
So the root cause is that the server is very slow. But the client
Go app has a deadline of 1 second -- which the developers of this app assumed to be more than enough since this workload runs on localhost (both apps on the same machine).
So we cannot do much about terrible performance of the server
app, so I worked around the problem by modifying the client code -- increasing from 1 second timeout to 5 second timeout:
~/gramineproject/gramine/CI-Examples/go/grpc-go/examples/helloworld/greeter_client$ git diff
diff --git a/examples/helloworld/greeter_client/main.go b/examples/helloworld/greeter_client/main.go
index 4529069..211d74c 100644
--- a/examples/helloworld/greeter_client/main.go
+++ b/examples/helloworld/greeter_client/main.go
@@ -50,7 +50,7 @@ func main() {
c := pb.NewGreeterClient(conn)
// Contact the server and print out its response.
- ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+ ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
defer cancel()
r, err := c.SayHello(ctx, &pb.HelloRequest{Name: *name})
if err != nil {
I'm closing this issue, since the root cause was analyzed -- see my previous comment.
@dimakuv Could you please keep this case open so that we could track gramine and the workload performance?
This is a known limitation (that static go binaries are slow under Gramine), I don't see a point in keeping it open.
@ying2liu if you change the manifest like this, do you still get the same error? I tried it on a Xeon Gold.
sgx.enclave_size = "8G"
sgx.thread_num = 128
#sys.stack.size = "256M"
sgx.preheat_enclave = true
You can also add timing to the client side code, to see if there is anything that does not hit the deadline, but very close:
start := time.Now()
r, err := c.SayHello(ctx, &pb.HelloRequest{Name: *name})
if err != nil {
log.Fatalf("could not greet: %v", err)
}
end := time.Now()
dur := end.Sub(start)
log.Printf("Greeting: %s. Elapsed: %f", r.GetMessage(), dur.Seconds())
sgx.preheat_enclave = true
This is a very good idea. It should help significantly with the initialization phase and the very-first client requests.
@lejunzhu Thank you so much for your suggestion. I will give it a try.
@lejunzhu This manifest change works. The test could complete successfully without any timeout error. Thanks!
Description of the problem
I observed failure when running gRPC client/server application using gramine sgx. This application works well in native and gramine without sgx. I started the server side first and it would listen at localhost 127.0.0.1:50051, then ran client side continuously for 40 times. I received several error messages as the following:
The failure happened randomly. I used the latest Gramine source code. Here is the gRPC source code
Steps to reproduce
Build the greeter_server:
Expected Results
server side:
client side:
Actual Results
server side:
client side:
Additional Information
Please find the
manifest
,make
andlog
files attached. Makefile.zip server.manifest.template.zip server_debug.logGramine commit hash
dbddd90bb51d3c1feff04fc5387ea37073e9321e
server_debug_new.log