gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
603 stars 201 forks source link

Trying to run a Federated Learning Pytorch example with Flower on Gramine-sgx environment #344

Closed saruvig closed 2 years ago

saruvig commented 2 years ago

Hi All,

I am trying to run the quickstart Pytorch example from the Flower Git on gramine. The environment is Microsoft Azure. I have tested gramine on it before and was successfully able to run the examples in the Gramine git. Flower is a Federated Learning framework (https://github.com/adap/flower)

Below is the source code of the example: https://github.com/adap/flower/tree/main/examples/quickstart_pytorch

It has 1 server and 2 clients in this framework on different terminals of the VM, trying to communicate with one another. The example was running on the native setting. With the SGX on the communication fails. Below is the screenshot of the error and the log file.

image

log_flower.txt

Any ideas on why this could be happening? Also attaching the manifest file I am using to run this example. manifest.txt

dimakuv commented 2 years ago

The log (log_flower.txt) ends abruptly at:

[P1:T1:python3.8] trace: ---- shim_munmap(0x4f4d7f000, 0x40000) ...
[P1:T1:python3.8] trace: ---- return from shim_munmap(...) = 0x0
[P1:T

Probably it was cut by GitHub (as it is almost 10MB in size)? Could you look at the end of the log and copy-paste only the relevant parts?

saruvig commented 2 years ago

are you able to access it with link

https://github.com/saruvig/Basic_Cifar/blob/main/local_git/log.txt

dimakuv commented 2 years ago

I don't see anything special in that file (no error messages). The file ends with this:

[P1:T6:python3.8] trace: ---- shim_clock_gettime(1, 0x70a609420) = 0x0
[P1:T6:python3.8] trace: ---- shim_clock_gettime(1, 0x70a6093a0) = 0x0
[P1:T6:python3.8] trace: ---- shim_epoll_wait(3, 0x70b68c064, 100, 201) ...

This looks perfectly normal -- the application waits for some network connections and/or network packets.

saruvig commented 2 years ago

Yes, Even I could not find any error messages in the log file. The Client-side itself stops running after trying to connect to the server-side I think that's why there is no error message ( the stopIteration function as seen in the screenshot).

Any reason you think it is not able to create this connection. Do I have to make any changes in the manifest to enable such connections on SGX?

dimakuv commented 2 years ago

Any reason you think it is not able to create this connection. Do I have to make any changes in the manifest to enable such connections on SGX?

No, you don't have to make any changes in the manifest.

You should analyze Gramine log files of both the client and the server applications (both run inside Gramine SGX enclaves, right?). I don't see how we can currently help here.

saruvig commented 2 years ago

Yes, they run inside the same enclave as of now. Hence, just the one log file I was able to generate. Can I run them on separate enclaves on the same VM from different terminals?

dimakuv commented 2 years ago

It has 1 server and 2 clients in this framework on different terminals of the VM.

Yes, they run inside the same enclave as of now.

These two statements are contradictory. Each process (the server process, the first client process, the second client process) run in their own SGX enclaves (if this is how you set up your environment). These three processes cannot run in the same SGX enclave, that's not how Intel SGX works.

saruvig commented 2 years ago

if this is how you set up your environment

Sorry, I am not sure I understand this or how can I be sure if this is how I set up.

I used the usual guide to set up SGX on my VM, which runs fine. To run this specific example this is steps I follow:

  1. In my example directory I run Make SGX =1
  2. run: gramine-sgx ./pytorch ./server.py
  3. Open another terminal for Client 1 and run : gramine-sgx ./pytorch ./client.py
  4. Open another terminal for Client 2 and run: gramine-sgx ./pytorch ./client.py
dimakuv commented 2 years ago

Do you use the same manifest file for these three Gramine-SGX processes? Do they all run in the same directory?

What happens here is that your loader.log_file = "log.txt" writes the output from all three processes in the same file log.txt. So your log.txt contains a mess with all three outputs interposed and mangled.

You need to have three different manifest files, each with its own loader.log_file = "<unique-name>.txt".

saruvig commented 2 years ago

Do you use the same manifest file for these three Gramine-SGX processes? Do they all run in the same directory?

Yes, the same manifest file, in the same directory.

What happens here is that your loader.log_file = "log.txt" writes the output from all three processes in the same file log.txt. So your log.txt contains a mess with all three outputs interposed and mangled.

You need to have three different manifest files, each with its own loader.log_file = "<unique-name>.txt".

Ok, I understand. Maybe to begin with I can run them from 3 different directories with three separate manifest files. If I understand correctly, this in turn will process three separate enclaves and I should get a separate log for all three?

dimakuv commented 2 years ago

If I understand correctly, this in turn will process three separate enclaves and I should get a separate log for all three?

Yes, there will be three processes (and thus three SGX enclaves). And you will get three logs, each in a different directory.

saruvig commented 2 years ago

Based on the suggestions. I am now running clients in different enclaves with two different manifest files. To start with the server is running in native mode ( without gramine or gramine-sgx). I have added the following line in the manifest file:

sgx.rpc_thread_num=3

It code seems to be in an endless waiting loop, with the server waiting on signals from the client. I am even attaching the zip files of the log from both the clients. There are no error messages as such in the log files so not sure how to proceed to debug this.

https://github.com/saruvig/Basic_Cifar/blob/main/lof_client1.zip https://github.com/saruvig/Basic_Cifar/blob/main/log_client2.zip

saruvig commented 2 years ago

Also, would you happen to have any other simple example that I can try with remote calls being made server and clients, successfully running on the gramine-sgx setup? I don't see in the examples repository.

dimakuv commented 2 years ago

with the server waiting on signals from the client with remote calls being made server and clients

What do you mean by "remote calls" exactly? Do you mean a TCP/IP channel between the server and the client?

And what do you mean by "signals from the client"? Do you mean TCP/IP connection requests from the client? Or do you literally mean Linux signals (like SIGINT, SIGCONT) from the client?

The attached logs are huge (~25MB), so I only briefly looked at them. The clients get stuck on the epoll wait:

[P1:T8:python3.8] trace: ---- return from shim_epoll_wait(...) = -514
[P1:T8:python3.8] trace: ---- shim_epoll_wait(3, 0x2e3432064, 100, 201) ...
[P1:T17:python3.8] trace: ---- shim_futex(0x27f1d1494, FUTEX_PRIVATE|FUTEX_WAIT, 31312, 0, 0x7a50, 202) ...
[P1:T14:python3.8] trace: ---- return from shim_futex(...) = 0x0
[P1:T14:python3.8] trace: ---- shim_futex(0x27f1d1494, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x27f1d1494, 0) ...
[P1:T14:python3.8] trace: ---- return from shim_futex(...) = 0xb
[P1:T25:python3.8] trace: ---- return from shim_futex(...) = 0x0
[P1:T8:python3.8] trace: ---- return from shim_epoll_wait(...) = -514
[P1:T8:python3.8] trace: ---- shim_epoll_wait(3, 0x2e3432064, 100, 201) ...

This may be a correct behavior, or maybe a wrong behavior. Hard to say anything specific. But it looks like something with epoll.

@saruvig We found a bug in epoll that concerns interrupts (error -514 that you see in the log snippet above). This bug fix (https://github.com/gramineproject/gramine/pull/381) may fix your issue. Could you try it out or just wait a couple days until we merge this PR?

saruvig commented 2 years ago

What do you mean by "remote calls" exactly? Do you mean a TCP/IP channel between the server and the client?

And what do you mean by "signals from the client"? Do you mean TCP/IP connection requests from the client? Or do you literally mean Linux signals (like SIGINT, SIGCONT) from the client?

In the Flower framework, the client and server communicate with the gRPC communication protocol. i think the gRPC interface is based on TCP connections. For this example, Clients are responsible for generating individual weight-updates for the model based on their local datasets. These updates are then sent to the server which will aggregate them and returns an improved version of the model back to the clients. https://grpc.io/docs/what-is-grpc/faq/

@saruvig We found a bug in epoll that concerns interrupts (error -514 that you see in the log snippet above). This bug fix (#381) may fix your issue. Could you try it out or just wait a couple days until we merge this PR?

Sure, ok. Thanks for the heads up.

boryspoplawski commented 2 years ago

@dimakuv -514 means that a syscall was restarted, so has nothing to do with #381 but this may be fixed by the epoll rewrite, which we merged recently. @saruvig Could you just try the current master?

saruvig commented 2 years ago

Thanks @boryspoplawski . With the current master, the above problem did get solved and I was able to run the code. But now I want to run both the server and clients in enclaves ( so 3 processes one each for the server and 2 clients; earlier only the clients were in enclaves and the server was being run natively)

The three manifest files are the same as the quickstart PyTorch example except for these two lines sgx.thread_num=256 sgx.rpc_thread_num=3

I am attaching the log of the client that throws the error. There are no specific 'error' messages in the log. https://github.com/saruvig/Basic_Cifar/blob/main/log_client2.txt

dimakuv commented 2 years ago

The three manifest files are the same as the quickstart PyTorch example except for these two lines sgx.thread_num=256 sgx.rpc_thread_num=3

This configuration is wrong. Please read https://gramine.readthedocs.io/en/latest/manifest-syntax.html#number-of-rpc-threads-exitless-feature carefully. Basically, you cannot have rpc_thread_num < thread_num, otherwise your enclave threads will starve and hang.

Why do you even want to use rpc_thread_num in your workload? From what I understand, you Flower workload is multi-threaded and embarassingly parallel, so you will not have any performance improvement with the Exitless feature (which is enabled through rpc_thread_num).

I strongly suggest to remove the sgx.rpc_thread_num line from your manifest at all.

saruvig commented 2 years ago

Ok. My bad, I had misunderstood the rpc_thread_num feature. I have removed it now. But unfortunately, the original problem still remains.

lejunzhu commented 2 years ago

The problem is accept4() in Gramine returns empty peer address when connected. Therefore this line always sets the client id as "ipv6:[::]:0", and the second client is considered a reconnection of the first client. When running without Gramine, the id should be something like "ipv4:127.0.0.1:12345".

https://github.com/adap/flower/blob/71f8226f2fb56d0624427f358b6805408945cd95/src/py/flwr/server/grpc_server/flower_service_servicer.py#L89

As a workaround, if you give a fake, unique id at the line above, the example can finish successfully.

dimakuv commented 2 years ago

The problem is accept4() in Gramine returns empty peer address when connected.

Interesting. I guess the bug is somewhere around this line: https://github.com/gramineproject/gramine/blob/b571f2a3efda7d20db958ea1a7880b109acaa9c0/LibOS/shim/src/sys/shim_socket.c#L1013

We need to rewrite the Sockets subsystem in Gramine...

lejunzhu commented 2 years ago

Interesting. I guess the bug is somewhere around this line:

https://github.com/gramineproject/gramine/blob/b571f2a3efda7d20db958ea1a7880b109acaa9c0/LibOS/shim/src/sys/shim_socket.c#L1013

Exactly. Should be *addrlen = inet_copy_addr(cli_sock->domain, addr, *addrlen, &cli_sock->addr.in.conn);

dimakuv commented 2 years ago

@lejunzhu Maybe you could submit a PR?