p4lang / p4runtime-shell

An interactive Python shell for P4Runtime
Apache License 2.0
79 stars 40 forks source link

Fix a bug that caused the controller to hang indefinitely on exit. #113

Open emilkhshiboun opened 1 year ago

emilkhshiboun commented 1 year ago

During the teardown process (i.e when issuing quit() command), the controller tries to close all resources including the StreamChannel (StreamChannel is started on controller initaliazation).

StreamChannel events are handled by separated thread, which listens for gRPC messages/replies from the P4Agent, when there is no requests, the StreamChannel blocks, which in turn blocks the thread. As part of the teardown, the controller join() the StreamChannel thread, which does not exit as there are no requests from the P4Agent and thus the controller hangs forever.

The fix uses TimeoutIterator to query StreamChannel events with given timeout, if no events are received the loops continues and thus can check for program termination

This commit adds dependecy on 'iterators' module that implments the TimeoutIterator

antoninbas commented 1 year ago

I am not able to reproduce this bug, so could you share some details? When running quit, the client should already be closing the stream. If the P4Runtime server is implemented correctly, the for loop in stream_recv should break. Are you using simple_switch_grpc?

These are the steps I used: 1) Start simple_switch_grpc with simple_switch_grpc --no-p4 --log-console 2) Start the p4runtime-shell with python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json 3) Type quit in the shell; in my case it exited immediately

I tried the steps above a couple of times.

antonin@ubuntu-22:~$ simple_switch_grpc --no-p4 --log-console
Calling target program-options parser
Server listening on 0.0.0.0:9559
[14:03:55.511] [bmv2] [I] [thread 7705] Starting Thrift server on port 9090
[14:03:55.512] [bmv2] [I] [thread 7705] Thrift server was started
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'ipv4_lpm': NoAction - 
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'forward': NoAction - 
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'send_frame': NoAction - 
[14:03:57.482] [bmv2] [D] [thread 7723] simple_switch target has been notified of a config swap
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'ipv4_lpm': NoAction - 
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'forward': NoAction - 
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'send_frame': NoAction - 
[14:04:00.393] [bmv2] [D] [thread 7724] simple_switch target has been notified of a config swap
(venv) antonin@ubuntu-22:~$ python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json
*** Welcome to the IPython shell for P4Runtime ***
P4Runtime sh >>> quit
(venv) antonin@ubuntu-22:~$ python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json
*** Welcome to the IPython shell for P4Runtime ***
P4Runtime sh >>> quit
(venv) antonin@ubuntu-22:~$
emilkhshiboun commented 1 year ago

I am not able to reproduce this bug, so could you share some details? When running quit, the client should already be closing the stream. If the P4Runtime server is implemented correctly, the for loop in stream_recv should break. Are you using simple_switch_grpc?

These are the steps I used:

  1. Start simple_switch_grpc with simple_switch_grpc --no-p4 --log-console
  2. Start the p4runtime-shell with python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json
  3. Type quit in the shell; in my case it exited immediately

I tried the steps above a couple of times.

antonin@ubuntu-22:~$ simple_switch_grpc --no-p4 --log-console
Calling target program-options parser
Server listening on 0.0.0.0:9559
[14:03:55.511] [bmv2] [I] [thread 7705] Starting Thrift server on port 9090
[14:03:55.512] [bmv2] [I] [thread 7705] Thrift server was started
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'ipv4_lpm': NoAction - 
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'forward': NoAction - 
[14:03:57.481] [bmv2] [D] [thread 7723] Set default default entry for table 'send_frame': NoAction - 
[14:03:57.482] [bmv2] [D] [thread 7723] simple_switch target has been notified of a config swap
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'ipv4_lpm': NoAction - 
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'forward': NoAction - 
[14:04:00.392] [bmv2] [D] [thread 7724] Set default default entry for table 'send_frame': NoAction - 
[14:04:00.393] [bmv2] [D] [thread 7724] simple_switch target has been notified of a config swap
(venv) antonin@ubuntu-22:~$ python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json
*** Welcome to the IPython shell for P4Runtime ***
P4Runtime sh >>> quit
(venv) antonin@ubuntu-22:~$ python3 -m p4runtime_sh --device-id 0 --election-id 0,1 --config simple_router.proto.txt,simple_router.json
*** Welcome to the IPython shell for P4Runtime ***
P4Runtime sh >>> quit
(venv) antonin@ubuntu-22:~$

Hi Anton, I'm not using simple_switch_grpc, i am using custom p4 server.

I might be missing something here, but I could not find any information on the p4 specification regarding client disconnecting, what do you mean the p4server implemented incorrectly ?

From the Controller code in tear_down(), no message/signal is sent to the p4 server, so the server can not know when the client disconnected (correct me if i am wrong)

Here is my output: (running with -v)

DEBUG:root:Creating P4Runtime client
DEBUG:root:Connecting to device 1000 at 127.0.0.1:9559
DEBUG:root:Using insecure channel
DEBUG:root:Session established, client is 'primary'
DEBUG:root:Setting forwarding pipeline config
DEBUG:root:Retrieving P4Info file
DEBUG:root:Parsing P4Info message
DEBUG:asyncio:Using selector: EpollSelector
*** Welcome to the IPython shell for P4Runtime ***
DEBUG:asyncio:Using selector: EpollSelector
P4Runtime sh >>> quit()
DEBUG:root:Cleaning up stream

The above stuck forever

antoninbas commented 1 year ago

I'm not using simple_switch_grpc, i am using custom p4 server.

That could explain it. Probably a bug in the implementation of the server.

I might be missing something here, but I could not find any information on the p4 specification regarding client disconnecting, what do you mean the p4server implemented incorrectly ?

I assume you mean P4Runtime specification here, and not P4 specification. This is out-of-scope of P4Runtime, or rather it is not specific to P4Runtime. This is about correctly implementing the P4Runtime service - and more specifically the StreamChannel RPC) in your gRPC server. So it's more about gRPC bi-directional streams, and less about p4Runtime. The correct implementation will depend on 1) the programming language you use for your gRPC server, and 2) whether you use synchronous or asynchronous gRPC APIs in the server.

The reference P4Runtime server (C++) can be found here: https://github.com/p4lang/PI/blob/main/proto/server/pi_server.cpp

In my experience, it is easier to use synchronous gRPC, while asynchronous APIs require more care when dealing with streams. I don't know which one you are using in your custom P4 server.