p4lang / p4app-switchML

Switch ML Application
https://switchml.readthedocs.io/
Apache License 2.0
171 stars 48 forks source link

No traffic is communicated due to timeout #24

Open lq470379 opened 2 years ago

lq470379 commented 2 years ago

I have setup SwitchML P4 app and controller successfully and compiled the client library for RDMA backend but I'm having a similar issue as #8 that no communication happens after worker setup. I have two workers with disabled ICRC and have tried this both for hello_world example and allreduce benchmark. Here is the output of GLOG_logtostderr=1 GLOG_v=2 ./allreduce_benchmark for one of the workers (the other worker has a similar output and the behavior is the same for hello_world as well):

 ../third_party/stdarg.h ?? ../../switchml_bkp.cfg
I0902 11:34:34.264950 18461 context.cc:64] Starting switchml context.
I0902 11:34:34.265347 18461 config.cc:139] Using this configuration file 'switchml.cfg'.
I0902 11:34:34.265789 18461 config.cc:216] Printing configuration
I0902 11:34:34.265799 18461 config.cc:219] 
[general]
    rank = 1
    num_workers = 2
    num_worker_threads = 4
    max_outstanding_packets = 4
    packet_numel = 64
    backend = rdma
    scheduler = fifo
    prepostprocessor = cpu_exponent_quantizer
    instant_job_completion = 0
    controller_ip_str = 10.0.0.1
    controller_port = 50099
    timeout = 10000
    timeout_threshold = 100
    timeout_threshold_increment = 100
    --(derived)--
    max_outstanding_packets_per_worker_thread = 1
I0902 11:34:34.265834 18461 config.cc:270] 
[backend.rdma]
    msg_numel = 64
    device_name = mlx5_0
    device_port_id = 1
    gid_index = 3
    --(derived)--
    num_pkts_per_msg = 1
    max_outstanding_msgs = 4
    max_outstanding_msgs_per_worker_thread = 1
I0902 11:34:34.265851 18461 rdma_backend.cc:42] Setting up worker.
I0902 11:34:34.266697 18461 rdma_endpoint.cc:65] Found Verbs device mlx5_0 with guid 0x98039b03008e0d50
I0902 11:34:34.266713 18461 rdma_endpoint.cc:65] Found Verbs device mlx5_1 with guid 0x98039b03008e0d51
I0902 11:34:34.266721 18461 rdma_endpoint.cc:79] Using Verbs device mlx5_0 gid index 3
I0902 11:34:34.290702 18461 rdma_endpoint.cc:116] GID 0 is 0x80fe 0x500d8efeff9b039a
I0902 11:34:34.290791 18461 rdma_endpoint.cc:116] GID 1 is 0x80fe 0x500d8efeff9b039a
I0902 11:34:34.290869 18461 rdma_endpoint.cc:116] GID 2 is 0 0x401a8c0ffff0000
I0902 11:34:34.290951 18461 rdma_endpoint.cc:116] GID 3 is 0 0x401a8c0ffff0000
I0902 11:34:39.579232 18461 context.cc:99] Switchml context started successfully.
Submitting 5 warmup jobs.
I0902 11:34:39.766185 18467 rdma_utils.h:193] Worker 0 bound to core 0 on NUMA node 0
I0902 11:34:39.766194 18469 rdma_utils.h:193] Worker 2 bound to core 2 on NUMA node 0
I0902 11:34:39.766386 18469 rdma_worker_thread.cc:129] Worker 2 QP 0:0x519 using rkey 5 for remote rkey 63210
I0902 11:34:39.766402 18467 rdma_worker_thread.cc:129] Worker 0 QP 0:0x517 using rkey 1 for remote rkey 63210
I0902 11:34:39.773722 18471 rdma_utils.h:193] Worker 3 bound to core 3 on NUMA node 0
I0902 11:34:39.773824 18471 rdma_worker_thread.cc:129] Worker 3 QP 0:0x51a using rkey 7 for remote rkey 63210
I0902 11:34:39.803228 18468 rdma_utils.h:193] Worker 1 bound to core 1 on NUMA node 0
I0902 11:34:39.803387 18468 rdma_worker_thread.cc:129] Worker 1 QP 0:0x518 using rkey 3 for remote rkey 63210

After no progress when I exit the process I get:

^CSignal 2 received, preparing to exit...
I0902 11:37:39.248157 18462 context.cc:105] Stopping switchml context
I0902 11:37:39.248188 18462 scheduler.cc:48] Waking up waiting threads
I0902 11:37:39.248227 18462 rdma_backend.cc:56] Cleaning up worker.
I0902 11:37:39.248417 18462 stats.cc:97] Stats: 
    Submitted jobs: #5#
    Submitted jobs sizes: #[268435456,268435456,268435456,268435456,268435456,]#
    Submitted jobs sizes distribution: #Sum: 1342177280 Mean: 268435456.0000 Max: 268435456  Min: 268435456  Median: 268435456  Stdev: 0.0000    #
    Finished jobs: #0#
    Worker thread: #0#
        Total packets sent: #18#
        Total packets received: #0#
        Wrong packets received: #0#
        Correct packets received: #0#
        Number of timeouts: #17#
    Worker thread: #1#
        Total packets sent: #18#
        Total packets received: #0#
        Wrong packets received: #0#
        Correct packets received: #0#
        Number of timeouts: #17#
    Worker thread: #2#
        Total packets sent: #18#
        Total packets received: #0#
        Wrong packets received: #0#
        Correct packets received: #0#
        Number of timeouts: #17#
    Worker thread: #3#
        Total packets sent: #18#
        Total packets received: #0#
        Wrong packets received: #0#
        Correct packets received: #0#
        Number of timeouts: #17#
I0902 11:37:39.248509 18462 context.cc:130] Stopped switchml context
Warmup finished.
Submitting 10 jobs.
Signal handler thread is exiting

Here is the outputs on controller side:

SwitchML>show_switch_address

Switch MAC: 00:11:22:33:44:55 IP: 192.168.1.100

SwitchML>show_rdma_workers

                                                   Received              Sent        
 Worker ID     Worker MAC        Worker IP     Packets  /    Bytes      Packets  /    Bytes    
     0      98:03:9b:83:1a:b2   192.168.1.2       0     /      0           0     /      0      
     1      98:03:9b:8e:3d:ac   192.168.1.4       0     /      0           0     /      0      

SwitchML>show_ports

  Port Up Valid Enabled Speed  FEC    Tx Packets        Tx Bytes        Rx Packets        Rx Bytes        Rx Errors        Tx Errors        FCS Errors   
  1/0  1    1      1    100G  NONE       219             66345              68             10710              0                0                0        
  2/0  1    1      1    100G  NONE       181             60116             106              8959              0                0                0        
  3/0  1    1      1    100G  NONE       265             83132              33              6105              0                0                0        
  4/0  1    1      1    100G  NONE       178             59180             133             12990              0                0                0        

SwitchML>show_statistics

             Broadcasted         Recirculated       Retransmitted          Dropped       
  Index    Set 0     Set 1     Set 0     Set 1     Set 0     Set 1     Set 0     Set 1   
    0        0         0         0         0         0         0         0         0     
    1        0         0         0         0         0         0         0         0     
    2        0         0         0         0         0         0         0         0     
    3        0         0         0         0         0         0         0         0     
    4        0         0         0         0         0         0         0         0     
    5        0         0         0         0         0         0         0         0     
    6        0         0         0         0         0         0         0         0     
    7        0         0         0         0         0         0         0         0     

And on the switch side I get:

bf-sde.pm> show
-----+----+---+----+-------+----+--+--+---+---+---+--------+----------------+----------------+-
PORT |MAC |D_P|P/PT|SPEED  |FEC |AN|KR|RDY|ADM|OPR|LPBK    |FRAMES RX       |FRAMES TX       |E
-----+----+---+----+-------+----+--+--+---+---+---+--------+----------------+----------------+-
1/0  |23/0|132|2/ 4|100G   |NONE|Ds|Au|YES|ENB|UP |  NONE  |              68|             219|
2/0  |22/0|140|2/12|100G   |NONE|Ds|Au|YES|ENB|UP |  NONE  |             106|             181|
3/0  |21/0|148|2/20|100G   |NONE|Ds|Au|YES|ENB|UP |  NONE  |              33|             265|
4/0  |20/0|156|2/28|100G   |NONE|Ds|Au|YES|ENB|UP |  NONE  |             133|             178|

My environment is:

Switch: Wedge BF100-32x SDE: 9.9.0 Python: 3.8 NICs: ConnectX-5

My ports.yaml has:

ports:
    1/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:8e:82:98"}
    2/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:83:1a:b2"}
    3/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:83:34:d2"}
    4/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:8e:3d:ac"}

And finally my config file is here:

switchml.cfg.txt

My guess is that the switch data plane as an end point is unreachable for some reason (but only one packet does not timeout so I'm not sure). Is there a way to ensure connectivity between

Thank you!

lq470379 commented 2 years ago

@OasisArtisan @AmedeoSapio Also, I just noticed that if I ping between the workers and use tcpdump at the destination, the packets are arriving but with all zero values: Screen Shot 2022-09-02 at 4 31 58 PM

While the ICMP packets going out of the worker seem fine.

Does this mean they were malformed by the switch?

lq470379 commented 2 years ago

@OasisArtisan @AmedeoSapio I think this issue should be related to something simple like a misconfiguration but unfortunately I have not found a solution.