gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support
https://grapheneproject.io
GNU Lesser General Public License v3.0
771 stars 260 forks source link

Data transmission error with Python gRPC running in graphene #2664

Closed Kanakoo closed 3 years ago

Kanakoo commented 3 years ago

Description of the problem

When running python grpc in graphene, and grpc server and client are on different machines, data transmission sometimes goes wrong when data transmitted is large.

Steps to reproduce

The issue was tested on latest version v1.2-rc1. It happens both with graphene-direct and graphene-sgx when grpc server and client are on different machines, there's no such issue when running outside graphene. For simplicity, reproduce it with graphene-direct:

  1. Here is the example code: grpc.zip on 2 different machines, unzip grpc.zip to graphene/Examples/grpc
  2. Install dependencies on both machines:
    cd graphene/Examples/grpc
    python3.6 -m venv venv
    source venv/bin/activate
    python3 -m pip install --upgrade pip
    python3 -m pip install grpcio grpc_tools protobuf
    deactivate
  3. on client machine, modify ip address in graphene/Examples/grpc/scripts/grpc_client to server ip.
  4. on server machine, run:
    cd graphene/Examples/grpc
    make PYTHONPATH=/usr PYTHONVERSION=python3.6
    graphene-direct python scripts/grpc_server.py
  5. on client machine, run:
    cd graphene/Examples/grpc
    make PYTHONPATH=/usr PYTHONVERSION=python3.6
    graphene-direct python scripts/grpc_client.py

    Expected results

    In the example, server sends a string with 1000 000 'a' to client. After client finishes running, it will generate a file in graphene/Examples/grpc/scripts/message.txt, which writes the received data from server. The string in the file is expected to be 1000 000 'a'.

Actual results

Run grep -r "grpc" scripts/message.txt, the output is Binary file scripts/message.txt matches, which means there's other wrong string in the file. Actually when I open the file and search keyword "grpc", I get:

...aaaaaa^@^@^^^A^E^@^@^@^A@^Kgrpc-status^A0@^Lgrpc-message^@aaaaaaaaa...

Since the issue happens sometimes, you may not get the wrong string in just one run, you can run 1~10 times.

mkow commented 3 years ago

The issue was tested on latest version v1.2-rc1.

Can you try the current master?

Kanakoo commented 3 years ago

The issue was tested on latest version v1.2-rc1.

Can you try the current master?

Yes, I've just tried, the issue still exists.

Kanakoo commented 3 years ago

I've found another issue. When data transmitted is large enough, about 300M, sometimes gRPC client will output errors like these:

[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'helloworld.HelloReply.message' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
ERROR:grpc._common:Exception deserializing message!
Traceback (most recent call last):
  File "/venv/lib/python3.6/site-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
google.protobuf.message.DecodeError: Error parsing message
Traceback (most recent call last):
  File "scripts/grpc_client.py", line 27, in <module>
    run()
  File "scripts/grpc_client.py", line 19, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Exception deserializing response!"
        debug_error_string = "None"

or

Traceback (most recent call last):
  File "scripts/grpc_client.py", line 27, in <module>
    run()
  File "scripts/grpc_client.py", line 19, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Failed parsing HTTP/2"
        debug_error_string = "{"created":"@1630602970.948493000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Failed parsing HTTP/2","grpc_status":14}"

(the above errors happen when client sends 300M data to server.) or

Traceback (most recent call last):
  File "scripts/grpc_client.py", line 26, in <module>
    run()
  File "scripts/grpc_client.py", line 18, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name='hello'))
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Stream removed"
        debug_error_string = "{"created":"@1630517301.980850000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Stream removed","grpc_status":2}"

(the above error happens when server replies 300M data to client.)

To reproduce, we need to modify maximum message transmission size of grpc first. grpc_server.py and grpc_client.py codes in the exmaple zip file are as bellow: grpc_server.py:

from concurrent import futures
import logging

import grpc
from protos import helloworld_pb2
from protos import helloworld_pb2_grpc
MAX_MESSAGE_LENGTH = 1024*1024*1024

class Greeter(helloworld_pb2_grpc.GreeterServicer):

    def SayHello(self, request, context):
        print(len(request.name))
        data = (300 * 1024 * 1024) * 'a'
        return helloworld_pb2.HelloReply(message = data)

def serve():
    options = [('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH)]
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10), options = options)
    helloworld_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    logging.basicConfig()
    serve()

grpc_client.py:

from __future__ import print_function

import logging

import grpc
from protos import helloworld_pb2
from protos import helloworld_pb2_grpc
MAX_MESSAGE_LENGTH = 1024*1024*1024
def run():
    # NOTE(gRPC Python Team): .close() is possible on a channel and should be
    # used in circumstances in which the with statement does not fit the needs
    # of the code.
    options = [('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH)]
    with grpc.insecure_channel('10.67.125.170:50051', options = options) as channel:
        stub = helloworld_pb2_grpc.GreeterStub(channel)
        data = (300 * 1024 * 1024) * 'a'
        response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
        print(len(response.message))
    with open('/tcf/message.txt', 'wb') as f:
        f.write(response.message.encode())

if __name__ == '__main__':
    logging.basicConfig()
    run()
mkow commented 3 years ago

Interesting, seems like large sends/recvs/reads/writes get mangled. I'd expect that gRPC has proper retry loop for partial operations, but on the other hand I think we also have this part done properly :)

Could you try to reproduce this issue in C, without gRPC? This would require checking which syscalls gRPC uses for the buggy transmission (you can do this in Graphene with "trace" log level) and then trying to do the same in C and verifying if it works. But remember to implement a proper retry loop, as partial reads/writes are perfectly valid on Linux.

Kanakoo commented 3 years ago

Interesting, seems like large sends/recvs/reads/writes get mangled. I'd expect that gRPC has proper retry loop for partial operations, but on the other hand I think we also have this part done properly :)

Could you try to reproduce this issue in C, without gRPC? This would require checking which syscalls gRPC uses for the buggy transmission (you can do this in Graphene with "trace" log level) and then trying to do the same in C and verifying if it works. But remember to implement a proper retry loop, as partial reads/writes are perfectly valid on Linux.

I ran graphene with trace level, but couldn't find out which syscall caused the error... server log:

...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T5:python3] trace: ---- return from shim_epoll_wait(...) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563270) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563450) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563370) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_getcpu(0x2c8ae45631cc, 0x0, 0x0) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae45631c0) = 0x0
[P14582:T5:python3] trace: ---- shim_epoll_wait(3, 0x2c8ae70f4d04, 100, 201) ...
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T5:python3] trace: ---- return from shim_epoll_wait(...) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563270) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563370) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_getcpu(0x2c8ae45631cc, 0x0, 0x0) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae45631c0) = 0x0
[P14582:T5:python3] trace: ---- shim_epoll_wait(3, 0x2c8ae70f4d04, 100, 201) ...
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
...

client log when everything is normal:

...
[P88349:T3:python3] trace: ---- shim_madvise(0x5eedbc29f000, 0x1fb000, MADV_DONTNEED) = -38
[P88349:T3:python3] debug: ---- shim_exit (returning 0)
[P88349:T3:python3] debug: Installed async event at 1630577724179050
[P88349:shim] debug: Thread exited, cleaning up
[P88349:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P88349:T1:python3] trace: ---- shim_close(4) = 0x0
[P88349:T1:python3] trace: ---- shim_close(5) = 0x0
[P88349:T1:python3] trace: ---- shim_close(3) = 0x0
[P88349:T1:python3] trace: ---- shim_munmap(0x5eedbc6b1000, 0xb0000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_mmap(0x0, 0x12c01000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P88349:T1:python3] trace: ---- return from shim_mmap(...) = 0x5eed73399000
[P88349:T1:python3] trace: ---- shim_munmap(0x5eed73399000, 0x12c01000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_munmap(0x5eed85f9a000, 0x12c01000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_write(1, 0x5eedc04a9c90, 0xa) ...
314572800
[P88349:T1:python3] trace: ---- return from shim_write(...) = 0xa
[P88349:T1:python3] trace: ---- shim_rt_sigaction([SIGINT], 0x5eedc1b89970, 0x5eedc1b89a10, 0x8) = 0x0
[P88349:T1:python3] trace: ---- shim_sigaltstack(0x0, 0x5eedc1b89c00) = 0x0
[P88349:T1:python3] trace: ---- shim_sigaltstack(0xa76c90, 0x0) = 0x0
[P88349:T1:python3] trace: ---- shim_futex(0x5eedbe348948, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0x0, 0x5eedbd9f6260, 1) ...
[P88349:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c20) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c20) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c30) = 0x0
[P88349:T1:python3] debug: ---- shim_exit_group (returning 0)
[P88349:T1:python3] debug: clearing POSIX locks for pid 1
[P88349:T1:python3] debug: sync client shutdown: closing handles
[P88349:T1:python3] debug: sync client shutdown: waiting for confirmation
[P88349:T1:python3] debug: sync client shutdown: finished
[P88349:shim] debug: Async worker thread terminated
[P88349:shim] debug: IPC worker: exiting worker thread
[P88349:T1:python3] debug: process 88349 exited with status 0

client log when error happens:

...
[P91774:T1:python3] trace: ---- shim_openat(AT_FDCWD, "/venv/lib/python3.6/site-packages/grpc/_channel.py", O_RDONLY|0x80000, 0000) = 0x3
[P91774:T1:python3] trace: ---- shim_fstat(3, 0x330b48526470) = 0x0
[P91774:T1:python3] trace: ---- shim_ioctl(3, TCGETS, 0x330b48526440) ...
[P91774:T1:python3] trace: ---- return from shim_ioctl(...) = -38
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_CUR) = 0x0
[P91774:T1:python3] trace: ---- shim_fcntl(3, OP 1030, 0x0) = 0x4
[P91774:T1:python3] trace: ---- shim_fcntl(4, F_GETFL, 0x7146cc) = 0x80000
[P91774:T1:python3] trace: ---- shim_newfstatat(4, "", 0x330b485262b0, 4096) = 0x0
[P91774:T1:python3] trace: ---- shim_read(4, 0x330b46b66090, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_close(4) = 0x0
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_SET) = 0x0
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_CUR) = 0x0
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b432a8d00, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b46b660b0, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b432a8d00, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b46b660b0, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: sync client: destroying handle: 0x10000015c
[P91774:T1:python3] trace: ---- shim_close(3) = 0x0
[P91774:T1:python3] trace: ---- shim_write(2, 0x330b46e6b320, 0x36c) ...
Traceback (most recent call last):
  File "scripts/grpc_client.py", line 26, in <module>
    run()
  File "scripts/grpc_client.py", line 18, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name='hello'))
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Stream removed"
        debug_error_string = "{"created":"@1630578115.422138000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Stream removed","grpc_status":2}"
>
[P91774:T1:python3] trace: ---- return from shim_write(...) = 0x36c
[P91774:T1:python3] trace: ---- shim_rt_sigaction([SIGINT], 0x330b48527970, 0x330b48527a10, 0x8) = 0x0
[P91774:T1:python3] trace: ---- shim_sigaltstack(0x0, 0x330b48527c00) = 0x0
[P91774:T1:python3] trace: ---- shim_sigaltstack(0xa76c90, 0x0) = 0x0
[P91774:T1:python3] trace: ---- shim_futex(0x330b44ca6948, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0x0, 0x330b43204ec0, 1) ...
[P91774:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c20) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c20) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c30) = 0x0
[P91774:T1:python3] debug: ---- shim_exit_group (returning 1)
[P91774:T1:python3] debug: clearing POSIX locks for pid 1
[P91774:T1:python3] debug: sync client shutdown: closing handles
[P91774:T1:python3] debug: sync client shutdown: waiting for confirmation
[P91774:T1:python3] debug: sync client shutdown: finished
[P91774:shim] debug: Async worker thread terminated
[P91774:shim] debug: IPC worker: exiting worker thread
[P91774:T1:python3] debug: process 91774 exited with status 1

Here is the detailed full log: grpc-log.zip

I've tried tcp socket with sendmsg and recvmsg in C outside graphene, but when data transmitted is more than about 5M, it will cause core dumped. It seems hard to try 300M.

jkr0103 commented 3 years ago

Root caused the issue and fixed. I will submit the PR on Monday. Please assign the bug to me and add appropriate bug tag and priority.

boryspoplawski commented 3 years ago

@jkr0103 Thanks! Could you describe shortly what was the issue (or give offending file+line, if it was simple issue)?

jkr0103 commented 3 years ago

if one of the `DkStreamWrite' in sendmsg 'https://github.com/oscarlab/graphene/blob/1a2275e00d00a29de12e2d7de840141ce264b2cb/LibOS/shim/src/sys/shim_socket.c#L1157' invocations only did a partial write, the code simply continues writing from the next buffer (buf[i+1]) which is wrong. We should bail out as we do in recvmsg https://github.com/oscarlab/graphene/blob/1a2275e00d00a29de12e2d7de840141ce264b2cb/LibOS/shim/src/sys/shim_socket.c#L1484-L1487

boryspoplawski commented 3 years ago

@jkr0103 Yeah, this must be it, thanks for debugging it.

P.S. Don't forget to add "Fixes #number" to the PR fixing it.