sonic-net / DASH

Disaggregated APIs for SONiC Hosts
Apache License 2.0
84 stars 91 forks source link

make run-all-tests fails #218

Open mhanif opened 2 years ago

mhanif commented 2 years ago

On a recent (today) clean clone from "main", make run-all-tests fails with the following error - Not sure what I am doing wrong:

$ make run-all-tests
# Ensure P4Runtime server is listening
t=5; \
while [ ${t} -ge 1 ]; do \
    if sudo lsof -i:9559 | grep LISTEN >/dev/null; then \
        break; \
    else \
        sleep 1; \
        t=`expr $t - 1`; \
    fi; \
done; \
docker exec -w /tests/libsai/vnet_out simple_switch-mhanif ./vnet_out
lsof: no pwd entry for UID 4321
lsof: no pwd entry for UID 4321
GRPC call SetForwardingPipelineConfig 0.0.0.0:9559 => /etc/dash/dash_pipeline.json, /etc/dash/dash_pipeline_p4rt.txt
GRPC ERROR[7]: Not primary, GRPC call Write::INSERT ERROR:
table_id: 38960243 match { field_id: 1 exact { value: "\000\000<" } } action { action { action_id: 21912829 } }Failed to create Direction Lookup Entry
make: *** [Makefile:253: run-libsai-test] Error 1
chrispsommers commented 2 years ago

Hi @hanif, this is similar to the https://github.com/Azure/DASH/issues/186 one you closed yesterday. The symptoms are similar (except this time you also have lsof: no pwd entry for UID 4321 which must be related to recently merged https://github.com/Azure/DASH/pull/202>

Like last time, I see GRPC ERROR[7]: Not primary, GRPC call Write::INSERT ERROR: which I've seen in the past when there is already a P4Runtime client owning the session to the switch. Can you confirm no extraneous docker containers running which might have attached to the bmv2 switch. Please paste here the output of docker ps -a, thanks

mhanif commented 2 years ago

Hi @chrispsommers, thanks a lot for looking in to this. Below is a requested output:

$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND                  CREATED         STATUS                       PORTS     NAMES
940c99d8af69   chrissommers/dash-saithrift-bldr:220819   "./saiserver"            2 hours ago     Up 2 hours                             dash-saithrift-server-mhanif
eeb81e8cff56   chrissommers/dash-bmv2-bldr:220819        "env LD_LIBRARY_PATH…"   2 hours ago     Up 2 hours                             simple_switch-mhanif
058107d1ad3e   hello-world                               "/hello"                 5 months ago    Exited (0) 5 months ago                agitated_sinoussi
14ba71aec719   nf:latest                                 "tail -f /dev/null"      10 months ago   Exited (137) 10 months ago             fw
beb0503f1b82   endpoint:latest                           "tail -f /dev/null"      10 months ago   Exited (137) 10 months ago             ext
1bb2dabe1b0f   endpoint:latest                           "tail -f /dev/null"      10 months ago   Exited (137) 10 months ago             int
36465ac7f57f   endpoint:latest                           "tail -f /dev/null"      10 months ago   Exited (137) 10 months ago             h1
9acb562fb0f7   8425a5f345b8                              "/bin/sh -c 'apt-get…"   10 months ago   Exited (1) 10 months ago               determined_jackson
16377b7c6997   ubuntu:trusty                             "sh"                     10 months ago   Exited (0) 10 months ago               affectionate_lovelace
chrispsommers commented 2 years ago

Looks normal too me. I assume you executed the customary three commands in three consoles (make run-switch, make run-saithrift-server, make run-all tests?). Does this happen every time?

mhanif commented 2 years ago

Yes, I ran 3 commands in 3 separate console. Other behave as expected but run-all-tests doesn't . I tried couple of times after calling kill-all and re-running them and I get the same results. Thanks!

chrispsommers commented 2 years ago

Looks like it happened once in the CI pipeline, for the first time ever AFAIK: https://github.com/Azure/DASH/actions/runs/3075002621/jobs/4968124219#step:13:24

It passed on a subsequent re-try. I suspect a race condition when this line is executed. We look for a listener on the P4Runtime server socket, but there may be a delay before the server is actually "ready." Could you try inserting a sleep(3) or similar before this step and see if it makes your test runs succeed? Another experiment is to run all the steps for make run-all-tests manually, i.e. init-switch and so forth. It would also reinforce the theory that's it's a timing issue and will vary based on CPU speed, environment, etc. https://github.com/Azure/DASH/blob/17912b53472433b9bfef4d04651d8d816571c0e8/dash-pipeline/Makefile#L280