Closed Kanakoo closed 3 years ago
The issue was tested on latest version v1.2-rc1.
Can you try the current master?
The issue was tested on latest version v1.2-rc1.
Can you try the current master?
Yes, I've just tried, the issue still exists.
I've found another issue. When data transmitted is large enough, about 300M, sometimes gRPC client will output errors like these:
[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'helloworld.HelloReply.message' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
ERROR:grpc._common:Exception deserializing message!
Traceback (most recent call last):
File "/venv/lib/python3.6/site-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
google.protobuf.message.DecodeError: Error parsing message
Traceback (most recent call last):
File "scripts/grpc_client.py", line 27, in <module>
run()
File "scripts/grpc_client.py", line 19, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "Exception deserializing response!"
debug_error_string = "None"
or
Traceback (most recent call last):
File "scripts/grpc_client.py", line 27, in <module>
run()
File "scripts/grpc_client.py", line 19, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Failed parsing HTTP/2"
debug_error_string = "{"created":"@1630602970.948493000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Failed parsing HTTP/2","grpc_status":14}"
(the above errors happen when client sends 300M data to server.) or
Traceback (most recent call last):
File "scripts/grpc_client.py", line 26, in <module>
run()
File "scripts/grpc_client.py", line 18, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name='hello'))
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Stream removed"
debug_error_string = "{"created":"@1630517301.980850000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Stream removed","grpc_status":2}"
(the above error happens when server replies 300M data to client.)
To reproduce, we need to modify maximum message transmission size of grpc first. grpc_server.py and grpc_client.py codes in the exmaple zip file are as bellow: grpc_server.py:
from concurrent import futures
import logging
import grpc
from protos import helloworld_pb2
from protos import helloworld_pb2_grpc
MAX_MESSAGE_LENGTH = 1024*1024*1024
class Greeter(helloworld_pb2_grpc.GreeterServicer):
def SayHello(self, request, context):
print(len(request.name))
data = (300 * 1024 * 1024) * 'a'
return helloworld_pb2.HelloReply(message = data)
def serve():
options = [('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH)]
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10), options = options)
helloworld_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
server.add_insecure_port('[::]:50051')
server.start()
server.wait_for_termination()
if __name__ == '__main__':
logging.basicConfig()
serve()
grpc_client.py:
from __future__ import print_function
import logging
import grpc
from protos import helloworld_pb2
from protos import helloworld_pb2_grpc
MAX_MESSAGE_LENGTH = 1024*1024*1024
def run():
# NOTE(gRPC Python Team): .close() is possible on a channel and should be
# used in circumstances in which the with statement does not fit the needs
# of the code.
options = [('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH)]
with grpc.insecure_channel('10.67.125.170:50051', options = options) as channel:
stub = helloworld_pb2_grpc.GreeterStub(channel)
data = (300 * 1024 * 1024) * 'a'
response = stub.SayHello(helloworld_pb2.HelloRequest(name = data))
print(len(response.message))
with open('/tcf/message.txt', 'wb') as f:
f.write(response.message.encode())
if __name__ == '__main__':
logging.basicConfig()
run()
Interesting, seems like large sends/recvs/reads/writes get mangled. I'd expect that gRPC has proper retry loop for partial operations, but on the other hand I think we also have this part done properly :)
Could you try to reproduce this issue in C, without gRPC? This would require checking which syscalls gRPC uses for the buggy transmission (you can do this in Graphene with "trace" log level) and then trying to do the same in C and verifying if it works. But remember to implement a proper retry loop, as partial reads/writes are perfectly valid on Linux.
Interesting, seems like large sends/recvs/reads/writes get mangled. I'd expect that gRPC has proper retry loop for partial operations, but on the other hand I think we also have this part done properly :)
Could you try to reproduce this issue in C, without gRPC? This would require checking which syscalls gRPC uses for the buggy transmission (you can do this in Graphene with "trace" log level) and then trying to do the same in C and verifying if it works. But remember to implement a proper retry loop, as partial reads/writes are perfectly valid on Linux.
I ran graphene with trace level, but couldn't find out which syscall caused the error... server log:
...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T5:python3] trace: ---- return from shim_epoll_wait(...) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563270) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563450) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563370) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_getcpu(0x2c8ae45631cc, 0x0, 0x0) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae45631c0) = 0x0
[P14582:T5:python3] trace: ---- shim_epoll_wait(3, 0x2c8ae70f4d04, 100, 201) ...
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
[P14582:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P14582:T5:python3] trace: ---- return from shim_epoll_wait(...) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563270) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563370) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(0, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae4563240) = 0x0
[P14582:T5:python3] trace: ---- shim_getcpu(0x2c8ae45631cc, 0x0, 0x0) = 0x0
[P14582:T5:python3] trace: ---- shim_clock_gettime(1, 0x2c8ae45631c0) = 0x0
[P14582:T5:python3] trace: ---- shim_epoll_wait(3, 0x2c8ae70f4d04, 100, 201) ...
[P14582:T1:python3] trace: ---- return from shim_futex(...) = -110
[P14582:T1:python3] trace: ---- shim_clock_gettime(1, 0x2c8aea004790) = 0x0
[P14582:T1:python3] trace: ---- shim_gettimeofday(0x2c8aea004770, 0x0) = 0x0
[P14582:T1:python3] trace: ---- shim_futex(0x2c8ae61000c0, FUTEX_PRIVATE|FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 0, 0x2c8aea004760, 0x0, -1) ...
...
client log when everything is normal:
...
[P88349:T3:python3] trace: ---- shim_madvise(0x5eedbc29f000, 0x1fb000, MADV_DONTNEED) = -38
[P88349:T3:python3] debug: ---- shim_exit (returning 0)
[P88349:T3:python3] debug: Installed async event at 1630577724179050
[P88349:shim] debug: Thread exited, cleaning up
[P88349:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P88349:T1:python3] trace: ---- shim_close(4) = 0x0
[P88349:T1:python3] trace: ---- shim_close(5) = 0x0
[P88349:T1:python3] trace: ---- shim_close(3) = 0x0
[P88349:T1:python3] trace: ---- shim_munmap(0x5eedbc6b1000, 0xb0000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_mmap(0x0, 0x12c01000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P88349:T1:python3] trace: ---- return from shim_mmap(...) = 0x5eed73399000
[P88349:T1:python3] trace: ---- shim_munmap(0x5eed73399000, 0x12c01000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_munmap(0x5eed85f9a000, 0x12c01000) ...
[P88349:T1:python3] trace: ---- return from shim_munmap(...) = 0x0
[P88349:T1:python3] trace: ---- shim_write(1, 0x5eedc04a9c90, 0xa) ...
314572800
[P88349:T1:python3] trace: ---- return from shim_write(...) = 0xa
[P88349:T1:python3] trace: ---- shim_rt_sigaction([SIGINT], 0x5eedc1b89970, 0x5eedc1b89a10, 0x8) = 0x0
[P88349:T1:python3] trace: ---- shim_sigaltstack(0x0, 0x5eedc1b89c00) = 0x0
[P88349:T1:python3] trace: ---- shim_sigaltstack(0xa76c90, 0x0) = 0x0
[P88349:T1:python3] trace: ---- shim_futex(0x5eedbe348948, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0x0, 0x5eedbd9f6260, 1) ...
[P88349:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c20) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c20) = 0x0
[P88349:T1:python3] trace: ---- shim_fstat(0, 0x5eedc1b89c30) = 0x0
[P88349:T1:python3] debug: ---- shim_exit_group (returning 0)
[P88349:T1:python3] debug: clearing POSIX locks for pid 1
[P88349:T1:python3] debug: sync client shutdown: closing handles
[P88349:T1:python3] debug: sync client shutdown: waiting for confirmation
[P88349:T1:python3] debug: sync client shutdown: finished
[P88349:shim] debug: Async worker thread terminated
[P88349:shim] debug: IPC worker: exiting worker thread
[P88349:T1:python3] debug: process 88349 exited with status 0
client log when error happens:
...
[P91774:T1:python3] trace: ---- shim_openat(AT_FDCWD, "/venv/lib/python3.6/site-packages/grpc/_channel.py", O_RDONLY|0x80000, 0000) = 0x3
[P91774:T1:python3] trace: ---- shim_fstat(3, 0x330b48526470) = 0x0
[P91774:T1:python3] trace: ---- shim_ioctl(3, TCGETS, 0x330b48526440) ...
[P91774:T1:python3] trace: ---- return from shim_ioctl(...) = -38
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_CUR) = 0x0
[P91774:T1:python3] trace: ---- shim_fcntl(3, OP 1030, 0x0) = 0x4
[P91774:T1:python3] trace: ---- shim_fcntl(4, F_GETFL, 0x7146cc) = 0x80000
[P91774:T1:python3] trace: ---- shim_newfstatat(4, "", 0x330b485262b0, 4096) = 0x0
[P91774:T1:python3] trace: ---- shim_read(4, 0x330b46b66090, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_close(4) = 0x0
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_SET) = 0x0
[P91774:T1:python3] trace: ---- shim_lseek(3, 0x0, SEEK_CUR) = 0x0
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b432a8d00, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b46b660b0, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b432a8d00, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: ---- shim_read(3, 0x330b46b660b0, 0x2000) ...
[P91774:T1:python3] trace: ---- return from shim_read(...) = 0x2000
[P91774:T1:python3] trace: sync client: destroying handle: 0x10000015c
[P91774:T1:python3] trace: ---- shim_close(3) = 0x0
[P91774:T1:python3] trace: ---- shim_write(2, 0x330b46e6b320, 0x36c) ...
Traceback (most recent call last):
File "scripts/grpc_client.py", line 26, in <module>
run()
File "scripts/grpc_client.py", line 18, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name='hello'))
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/venv/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Stream removed"
debug_error_string = "{"created":"@1630578115.422138000","description":"Error received from peer ipv4:10.67.125.170:50051","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Stream removed","grpc_status":2}"
>
[P91774:T1:python3] trace: ---- return from shim_write(...) = 0x36c
[P91774:T1:python3] trace: ---- shim_rt_sigaction([SIGINT], 0x330b48527970, 0x330b48527a10, 0x8) = 0x0
[P91774:T1:python3] trace: ---- shim_sigaltstack(0x0, 0x330b48527c00) = 0x0
[P91774:T1:python3] trace: ---- shim_sigaltstack(0xa76c90, 0x0) = 0x0
[P91774:T1:python3] trace: ---- shim_futex(0x330b44ca6948, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0x0, 0x330b43204ec0, 1) ...
[P91774:T1:python3] trace: ---- return from shim_futex(...) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c20) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c20) = 0x0
[P91774:T1:python3] trace: ---- shim_fstat(0, 0x330b48527c30) = 0x0
[P91774:T1:python3] debug: ---- shim_exit_group (returning 1)
[P91774:T1:python3] debug: clearing POSIX locks for pid 1
[P91774:T1:python3] debug: sync client shutdown: closing handles
[P91774:T1:python3] debug: sync client shutdown: waiting for confirmation
[P91774:T1:python3] debug: sync client shutdown: finished
[P91774:shim] debug: Async worker thread terminated
[P91774:shim] debug: IPC worker: exiting worker thread
[P91774:T1:python3] debug: process 91774 exited with status 1
Here is the detailed full log: grpc-log.zip
I've tried tcp socket with sendmsg
and recvmsg
in C outside graphene, but when data transmitted is more than about 5M, it will cause core dumped. It seems hard to try 300M.
Root caused the issue and fixed. I will submit the PR on Monday. Please assign the bug to me and add appropriate bug tag and priority.
@jkr0103 Thanks! Could you describe shortly what was the issue (or give offending file+line, if it was simple issue)?
if one of the `DkStreamWrite' in sendmsg 'https://github.com/oscarlab/graphene/blob/1a2275e00d00a29de12e2d7de840141ce264b2cb/LibOS/shim/src/sys/shim_socket.c#L1157' invocations only did a partial write, the code simply continues writing from the next buffer (buf[i+1]) which is wrong. We should bail out as we do in recvmsg https://github.com/oscarlab/graphene/blob/1a2275e00d00a29de12e2d7de840141ce264b2cb/LibOS/shim/src/sys/shim_socket.c#L1484-L1487
@jkr0103 Yeah, this must be it, thanks for debugging it.
P.S. Don't forget to add "Fixes #number" to the PR fixing it.
Description of the problem
When running python grpc in graphene, and grpc server and client are on different machines, data transmission sometimes goes wrong when data transmitted is large.
Steps to reproduce
The issue was tested on latest version v1.2-rc1. It happens both with graphene-direct and graphene-sgx when grpc server and client are on different machines, there's no such issue when running outside graphene. For simplicity, reproduce it with graphene-direct:
Expected results
In the example, server sends a string with 1000 000 'a' to client. After client finishes running, it will generate a file in graphene/Examples/grpc/scripts/message.txt, which writes the received data from server. The string in the file is expected to be 1000 000 'a'.
Actual results
Run
grep -r "grpc" scripts/message.txt
, the output isBinary file scripts/message.txt matches
, which means there's other wrong string in the file. Actually when I open the file and search keyword "grpc", I get:Since the issue happens sometimes, you may not get the wrong string in just one run, you can run 1~10 times.