Closed saruvig closed 2 years ago
The log (log_flower.txt
) ends abruptly at:
[P1:T1:python3.8] trace: ---- shim_munmap(0x4f4d7f000, 0x40000) ...
[P1:T1:python3.8] trace: ---- return from shim_munmap(...) = 0x0
[P1:T
Probably it was cut by GitHub (as it is almost 10MB in size)? Could you look at the end of the log and copy-paste only the relevant parts?
are you able to access it with link
https://github.com/saruvig/Basic_Cifar/blob/main/local_git/log.txt
I don't see anything special in that file (no error messages). The file ends with this:
[P1:T6:python3.8] trace: ---- shim_clock_gettime(1, 0x70a609420) = 0x0
[P1:T6:python3.8] trace: ---- shim_clock_gettime(1, 0x70a6093a0) = 0x0
[P1:T6:python3.8] trace: ---- shim_epoll_wait(3, 0x70b68c064, 100, 201) ...
This looks perfectly normal -- the application waits for some network connections and/or network packets.
Yes, Even I could not find any error messages in the log file. The Client-side itself stops running after trying to connect to the server-side I think that's why there is no error message ( the stopIteration function as seen in the screenshot).
Any reason you think it is not able to create this connection. Do I have to make any changes in the manifest to enable such connections on SGX?
Any reason you think it is not able to create this connection. Do I have to make any changes in the manifest to enable such connections on SGX?
No, you don't have to make any changes in the manifest.
You should analyze Gramine log files of both the client and the server applications (both run inside Gramine SGX enclaves, right?). I don't see how we can currently help here.
Yes, they run inside the same enclave as of now. Hence, just the one log file I was able to generate. Can I run them on separate enclaves on the same VM from different terminals?
It has 1 server and 2 clients in this framework on different terminals of the VM.
Yes, they run inside the same enclave as of now.
These two statements are contradictory. Each process (the server process, the first client process, the second client process) run in their own SGX enclaves (if this is how you set up your environment). These three processes cannot run in the same SGX enclave, that's not how Intel SGX works.
if this is how you set up your environment
Sorry, I am not sure I understand this or how can I be sure if this is how I set up.
I used the usual guide to set up SGX on my VM, which runs fine. To run this specific example this is steps I follow:
Do you use the same manifest file for these three Gramine-SGX processes? Do they all run in the same directory?
What happens here is that your loader.log_file = "log.txt"
writes the output from all three processes in the same file log.txt
. So your log.txt
contains a mess with all three outputs interposed and mangled.
You need to have three different manifest files, each with its own loader.log_file = "<unique-name>.txt"
.
Do you use the same manifest file for these three Gramine-SGX processes? Do they all run in the same directory?
Yes, the same manifest file, in the same directory.
What happens here is that your
loader.log_file = "log.txt"
writes the output from all three processes in the same filelog.txt
. So yourlog.txt
contains a mess with all three outputs interposed and mangled.You need to have three different manifest files, each with its own
loader.log_file = "<unique-name>.txt"
.
Ok, I understand. Maybe to begin with I can run them from 3 different directories with three separate manifest files. If I understand correctly, this in turn will process three separate enclaves and I should get a separate log for all three?
If I understand correctly, this in turn will process three separate enclaves and I should get a separate log for all three?
Yes, there will be three processes (and thus three SGX enclaves). And you will get three logs, each in a different directory.
Based on the suggestions. I am now running clients in different enclaves with two different manifest files. To start with the server is running in native mode ( without gramine or gramine-sgx). I have added the following line in the manifest file:
sgx.rpc_thread_num=3
It code seems to be in an endless waiting loop, with the server waiting on signals from the client. I am even attaching the zip files of the log from both the clients. There are no error messages as such in the log files so not sure how to proceed to debug this.
https://github.com/saruvig/Basic_Cifar/blob/main/lof_client1.zip https://github.com/saruvig/Basic_Cifar/blob/main/log_client2.zip
Also, would you happen to have any other simple example that I can try with remote calls being made server and clients, successfully running on the gramine-sgx setup? I don't see in the examples repository.
with the server waiting on signals from the client with remote calls being made server and clients
What do you mean by "remote calls" exactly? Do you mean a TCP/IP channel between the server and the client?
And what do you mean by "signals from the client"? Do you mean TCP/IP connection requests from the client? Or do you literally mean Linux signals (like SIGINT, SIGCONT) from the client?
The attached logs are huge (~25MB), so I only briefly looked at them. The clients get stuck on the epoll wait:
[P1:T8:python3.8] trace: ---- return from shim_epoll_wait(...) = -514
[P1:T8:python3.8] trace: ---- shim_epoll_wait(3, 0x2e3432064, 100, 201) ...
[P1:T17:python3.8] trace: ---- shim_futex(0x27f1d1494, FUTEX_PRIVATE|FUTEX_WAIT, 31312, 0, 0x7a50, 202) ...
[P1:T14:python3.8] trace: ---- return from shim_futex(...) = 0x0
[P1:T14:python3.8] trace: ---- shim_futex(0x27f1d1494, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x27f1d1494, 0) ...
[P1:T14:python3.8] trace: ---- return from shim_futex(...) = 0xb
[P1:T25:python3.8] trace: ---- return from shim_futex(...) = 0x0
[P1:T8:python3.8] trace: ---- return from shim_epoll_wait(...) = -514
[P1:T8:python3.8] trace: ---- shim_epoll_wait(3, 0x2e3432064, 100, 201) ...
This may be a correct behavior, or maybe a wrong behavior. Hard to say anything specific. But it looks like something with epoll.
@saruvig We found a bug in epoll that concerns interrupts (error -514
that you see in the log snippet above). This bug fix (https://github.com/gramineproject/gramine/pull/381) may fix your issue. Could you try it out or just wait a couple days until we merge this PR?
What do you mean by "remote calls" exactly? Do you mean a TCP/IP channel between the server and the client?
And what do you mean by "signals from the client"? Do you mean TCP/IP connection requests from the client? Or do you literally mean Linux signals (like SIGINT, SIGCONT) from the client?
In the Flower framework, the client and server communicate with the gRPC communication protocol. i think the gRPC interface is based on TCP connections. For this example, Clients are responsible for generating individual weight-updates for the model based on their local datasets. These updates are then sent to the server which will aggregate them and returns an improved version of the model back to the clients. https://grpc.io/docs/what-is-grpc/faq/
@saruvig We found a bug in epoll that concerns interrupts (error
-514
that you see in the log snippet above). This bug fix (#381) may fix your issue. Could you try it out or just wait a couple days until we merge this PR?
Sure, ok. Thanks for the heads up.
@dimakuv -514
means that a syscall was restarted, so has nothing to do with #381 but this may be fixed by the epoll rewrite, which we merged recently.
@saruvig Could you just try the current master?
Thanks @boryspoplawski . With the current master, the above problem did get solved and I was able to run the code. But now I want to run both the server and clients in enclaves ( so 3 processes one each for the server and 2 clients; earlier only the clients were in enclaves and the server was being run natively)
The three manifest files are the same as the quickstart PyTorch example except for these two lines sgx.thread_num=256 sgx.rpc_thread_num=3
I am attaching the log of the client that throws the error. There are no specific 'error' messages in the log. https://github.com/saruvig/Basic_Cifar/blob/main/log_client2.txt
The three manifest files are the same as the quickstart PyTorch example except for these two lines sgx.thread_num=256 sgx.rpc_thread_num=3
This configuration is wrong. Please read https://gramine.readthedocs.io/en/latest/manifest-syntax.html#number-of-rpc-threads-exitless-feature carefully. Basically, you cannot have rpc_thread_num < thread_num
, otherwise your enclave threads will starve and hang.
Why do you even want to use rpc_thread_num
in your workload? From what I understand, you Flower workload is multi-threaded and embarassingly parallel, so you will not have any performance improvement with the Exitless feature (which is enabled through rpc_thread_num
).
I strongly suggest to remove the sgx.rpc_thread_num
line from your manifest at all.
Ok. My bad, I had misunderstood the rpc_thread_num feature. I have removed it now. But unfortunately, the original problem still remains.
The problem is accept4() in Gramine returns empty peer address when connected. Therefore this line always sets the client id as "ipv6:[::]:0", and the second client is considered a reconnection of the first client. When running without Gramine, the id should be something like "ipv4:127.0.0.1:12345".
As a workaround, if you give a fake, unique id at the line above, the example can finish successfully.
The problem is accept4() in Gramine returns empty peer address when connected.
Interesting. I guess the bug is somewhere around this line: https://github.com/gramineproject/gramine/blob/b571f2a3efda7d20db958ea1a7880b109acaa9c0/LibOS/shim/src/sys/shim_socket.c#L1013
We need to rewrite the Sockets subsystem in Gramine...
Interesting. I guess the bug is somewhere around this line:
Exactly. Should be
*addrlen = inet_copy_addr(cli_sock->domain, addr, *addrlen, &cli_sock->addr.in.conn);
@lejunzhu Maybe you could submit a PR?
Hi All,
I am trying to run the quickstart Pytorch example from the Flower Git on gramine. The environment is Microsoft Azure. I have tested gramine on it before and was successfully able to run the examples in the Gramine git. Flower is a Federated Learning framework (https://github.com/adap/flower)
Below is the source code of the example: https://github.com/adap/flower/tree/main/examples/quickstart_pytorch
It has 1 server and 2 clients in this framework on different terminals of the VM, trying to communicate with one another. The example was running on the native setting. With the SGX on the communication fails. Below is the screenshot of the error and the log file.
log_flower.txt
Any ideas on why this could be happening? Also attaching the manifest file I am using to run this example. manifest.txt