usnistgov / ndn-dpdk

NDN-DPDK: High-Speed Named Data Networking Forwarder
https://www.nist.gov/publications/ndn-dpdk-ndn-forwarding-100-gbps-commodity-hardware
Other
131 stars 26 forks source link

Error when using the file server #76

Closed Syafaatdehaf closed 1 year ago

Syafaatdehaf commented 1 year ago

When I send NDN Video Streaming files using NDN-DPDK it always has the output as below:

Error: Interest rejected: expire @RDR-c(/8=fileserver/8=home-ndn-video-streaming-6s/8=file000.ts)
    at Object.tx (file:///usr/lib/node_modules/@ndn/cat/node_modules/@ndn/endpoint/lib/consumer_node.js:51:36)
Error: Command failed: export NDNTS_UPLINK=ndndpdk-udp: ; export NDNTS_NDNDPDK_GQLSERVER=http://127.0.0.1:3030 ; alias ndncat='npx -y -p https://ndnts-nightly.ndn.today/cat.tgz ndncat' ; ndncat get /fileserver/home-ndn-video-streaming-6s/file000.ts --ver rdr > temp/file000.ts
ndncat get-segmented <name>

retrieve segmented object

Positionals:
  name  name prefix                                          [string] [required]

Options:
  --help        Show help                                              [boolean]
  --version     Show version number                                    [boolean]
  --convention  Naming Convention revision
                                        [number] [choices: 1, 2, 3] [default: 3]
  --ver         version number or discovery method
                none: no version component
                cb
                p: send Interest with CanBePrefix
                rdr: use RDR protocol                  [string] [default: "rdr"]

Error: Interest rejected: expire @RDR-c(/8=fileserver/8=home-ndn-video-streaming-6s/8=file000.ts)
    at Object.tx (file:///usr/lib/node_modules/@ndn/cat/node_modules/@ndn/endpoint/lib/consumer_node.js:51:36)

    at checkExecSyncError (node:child_process:861:11)
    at execSync (node:child_process:932:15)
    at /home/ndn/vidstreamndn/index.js:42:5
    at Layer.handle [as handle_request] (/home/ndn/vidstreamndn/node_modules/express/lib/router/layer.js:95:5)
    at next (/home/ndn/vidstreamndn/node_modules/express/lib/router/route.js:144:13)
    at Route.dispatch (/home/ndn/vidstreamndn/node_modules/express/lib/router/route.js:114:3)
    at Layer.handle [as handle_request] (/home/ndn/vidstreamndn/node_modules/express/lib/router/layer.js:95:5)
    at /home/ndn/vidstreamndn/node_modules/express/lib/router/index.js:284:15
    at param (/home/ndn/vidstreamndn/node_modules/express/lib/router/index.js:365:14)
get file000.ts

when the interest rejected error is output, the forwarder on the server side automatically shuts down

and the output on the NDN-DPDK service is like this: 
0.0.1:49372","remote":"127.0.0.1:32939","scheme":"udp"}}
{"level":"info","ts":1688190865.002458,"logger":"socketface","msg":"face started","id":13789,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688190865.002869,"logger":"iface","msg":"face created","id":13789,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:41349","remote":"127.0.0.1:49926","scheme":"udp"}}
{"level":"info","ts":1688190877.720971,"logger":"socketface","msg":"face started","id":11970,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688190877.7213893,"logger":"iface","msg":"face created","id":11970,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:37323","remote":"127.0.0.1:41313","scheme":"udp"}}
{"level":"info","ts":1688190890.462187,"logger":"socketface","msg":"face started","id":22784,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688190890.4625895,"logger":"iface","msg":"face created","id":22784,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:49148","remote":"127.0.0.1:50284","scheme":"udp"}}
{"level":"info","ts":1688190903.1825066,"logger":"socketface","msg":"face started","id":45006,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688190903.1828682,"logger":"iface","msg":"face created","id":45006,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:37098","remote":"127.0.0.1:47066","scheme":"udp"}}
{"level":"info","ts":1688191047.1230705,"logger":"socketface","msg":"face started","id":30140,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191047.1235015,"logger":"iface","msg":"face created","id":30140,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:38893","remote":"127.0.0.1:51137","scheme":"udp"}}
{"level":"info","ts":1688191057.098063,"logger":"socketface","msg":"face started","id":8095,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191057.0986364,"logger":"iface","msg":"face created","id":8095,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:36500","remote":"127.0.0.1:44664","scheme":"udp"}}
{"level":"info","ts":1688191057.765742,"logger":"socketface","msg":"face started","id":16455,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191057.7660255,"logger":"iface","msg":"face created","id":16455,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:41477","remote":"127.0.0.1:48279","scheme":"udp"}}
{"level":"info","ts":1688191068.5392635,"logger":"socketface","msg":"face started","id":61160,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191068.5397217,"logger":"iface","msg":"face created","id":61160,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:57130","remote":"127.0.0.1:58155","scheme":"udp"}}
{"level":"info","ts":1688191075.2532852,"logger":"socketface","msg":"face started","id":13971,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191075.2537887,"logger":"iface","msg":"face created","id":13971,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:54485","remote":"127.0.0.1:51776","scheme":"udp"}}
{"level":"info","ts":1688191083.9833024,"logger":"socketface","msg":"face started","id":30255,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191083.9837615,"logger":"iface","msg":"face created","id":30255,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:49270","remote":"127.0.0.1:34813","scheme":"udp"}}
{"level":"info","ts":1688191096.703586,"logger":"socketface","msg":"face started","id":58196,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191096.7040253,"logger":"iface","msg":"face created","id":58196,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:39310","remote":"127.0.0.1:60808","scheme":"udp"}}
{"level":"info","ts":1688191558.075316,"logger":"socketface","msg":"face started","id":18527,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191558.0757403,"logger":"iface","msg":"face created","id":18527,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:60971","remote":"127.0.0.1:52906","scheme":"udp"}}
{"level":"info","ts":1688191572.6471715,"logger":"socketface","msg":"face started","id":42675,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191572.6476243,"logger":"iface","msg":"face created","id":42675,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:44392","remote":"127.0.0.1:41843","scheme":"udp"}}
{"level":"info","ts":1688191573.5770552,"logger":"socketface","msg":"face started","id":9347,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191573.5773072,"logger":"iface","msg":"face created","id":9347,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:35943","remote":"127.0.0.1:51147","scheme":"udp"}}
{"level":"info","ts":1688191583.9136796,"logger":"socketface","msg":"face started","id":57236,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191583.9140513,"logger":"iface","msg":"face created","id":57236,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:59425","remote":"127.0.0.1:35161","scheme":"udp"}}
{"level":"info","ts":1688191590.656008,"logger":"socketface","msg":"face started","id":57633,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191590.656446,"logger":"iface","msg":"face created","id":57633,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:54546","remote":"127.0.0.1:57321","scheme":"udp"}}
{"level":"info","ts":1688191599.484888,"logger":"socketface","msg":"face started","id":42160,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191599.4853387,"logger":"iface","msg":"face created","id":42160,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:47841","remote":"127.0.0.1:48348","scheme":"udp"}}
{"level":"info","ts":1688191612.1521122,"logger":"socketface","msg":"face started","id":35820,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191612.1524894,"logger":"iface","msg":"face created","id":35820,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:47803","remote":"127.0.0.1:38409","scheme":"udp"}}
{"level":"info","ts":1688191624.910554,"logger":"socketface","msg":"face started","id":42698,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191624.9109883,"logger":"iface","msg":"face created","id":42698,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:51308","remote":"127.0.0.1:35704","scheme":"udp"}}
{"level":"info","ts":1688191637.6287007,"logger":"socketface","msg":"face started","id":31083,"rx-impl":"RxEpoll","tx-impl":"TxSyscall"}
{"level":"info","ts":1688191637.629073,"logger":"iface","msg":"face created","id":31083,"socket":0,"mtu":2048,"locator":{"local":"127.0.0.1:46051","remote":"127.0.0.1:60031","scheme":"udp"}}

and this is the NDN-DPDK service output on the server node:

Starting NDN-DPDK service on 127.0.0.1:3030...
{"level":"info","ts":1688191578.8080637,"logger":"main","msg":"NDN-DPDK service starting","version":"v0.0.0-20230606231922-08d7daa0e15a","uid":0,"linux":"5.15.0-76-generic","dpdk":"23.03.0","spdk":"23.05"}
Started NDN-DPDK service on 127.0.0.1:3030.
{"level":"info","ts":1688191578.8086262,"logger":"main","msg":"GraphQL HTTP server starting","listen":"127.0.0.1:3030"}
{"level":"info","ts":1688192783.6141207,"logger":"main","msg":"activate start","role":"forwarder"}
{"level":"info","ts":1688192783.6534517,"logger":"DPDK","msg":"EAL: 3979 hugepages of size 2097152 reserved, but no mounted hugetlbfs found for that size"}
{"level":"warn","ts":1688192783.6535046,"logger":"DPDK","msg":"EAL: No free 1048576 kB hugepages reported on node 0"}
{"level":"warn","ts":1688192783.8047256,"logger":"DPDK.0","msg":"TELEMETRY: No legacy callbacks, legacy socket not created"}
{"level":"info","ts":1688192783.8047407,"logger":"ealinit","msg":"EAL ready","args":"-l 0,1,2,3,4,5,6,7,8,9 --in-memory --single-file-segments -d /usr/local/lib/dpdk/pmds-23.1 -a 00:00.0","main":0,"workers":[1,2,3,4,5,6,7,8,9],"sockets":[0],"has-hugepages":true,"has-pci":true,"iova-mode":"PA","runtime-dir":"/var/run/dpdk/rte"}
{"level":"info","ts":1688192783.811956,"logger":"SPDK.0","msg":"SPDK logger ready @/home/ndn/ndn-dpdk/dpdk/spdkenv/init.go:14"}
{"level":"info","ts":1688192783.8292758,"logger":"spdkenv","msg":"SPDK thread created","name":"W0000000000002","th":75171360704,"spdk-thread":140393021242336}
{"level":"info","ts":1688192783.830233,"logger":"ealthread","msg":"lcore requested","role":"CRYPTO","req-socket":"any","lc":3}
{"level":"info","ts":1688192783.8302443,"logger":"ealthread","msg":"lcore requested","role":"FWD","req-socket":"any","lc":4}
{"level":"info","ts":1688192783.830247,"logger":"ealthread","msg":"lcore requested","role":"RX","req-socket":0,"lc":1}
{"level":"info","ts":1688192783.8302498,"logger":"ealthread","msg":"lcore requested","role":"TX","req-socket":0,"lc":2}
{"level":"info","ts":1688192783.8302562,"logger":"ealthread","msg":"lcore requested","role":"FWD","req-socket":"any","lc":5}
{"level":"info","ts":1688192783.8302584,"logger":"ealthread","msg":"lcore requested","role":"TX","req-socket":"any","lc":6}
{"level":"info","ts":1688192783.8302605,"logger":"ealthread","msg":"lcore requested","role":"RX","req-socket":"any","lc":7}
{"level":"info","ts":1688192783.8302622,"logger":"ealthread","msg":"lcore requested","role":"FWD","req-socket":"any","lc":8}
{"level":"info","ts":1688192783.8302712,"logger":"ealthread","msg":"lcore requested","role":"FWD","req-socket":"any","lc":9}
{"level":"info","ts":1688192783.837659,"logger":"iface","msg":"TxLoop created","txl-ptr":75171294784}
{"level":"info","ts":1688192783.8376818,"logger":"iface","msg":"TxLoop created","txl-ptr":75171294592}
{"level":"info","ts":1688192783.855606,"logger":"pcct","msg":"init","pcct":75226939008,"mp":75225882048,"token-ht":75229989824}
{"level":"info","ts":1688192783.8556378,"logger":"pit","msg":"init","pcct":75226939008,"pit":75226939040}
{"level":"info","ts":1688192783.8556604,"logger":"cs","msg":"init","pcct":75226939008,"cs":75226939144,"cap-memory":32767,"cap-disk":0,"cap-indirect":32767}
{"level":"info","ts":1688192783.855688,"logger":"MinTmr","msg":"New","sched":"0x11840d8a40","slots":"4096","interval":"113333333","cb":"0xaedef0"}
{"level":"info","ts":1688192784.0809293,"logger":"pcct","msg":"init","pcct":75426168320,"mp":75425111360,"token-ht":77234867136}
{"level":"info","ts":1688192784.0809577,"logger":"pit","msg":"init","pcct":75426168320,"pit":75426168352}
{"level":"info","ts":1688192784.0809772,"logger":"cs","msg":"init","pcct":75426168320,"cs":75426168456,"cap-memory":32767,"cap-disk":0,"cap-indirect":32767}
{"level":"info","ts":1688192784.080996,"logger":"MinTmr","msg":"New","sched":"0x11fb8d8a40","slots":"4096","interval":"113333333","cb":"0xaedef0"}
{"level":"info","ts":1688192784.0955,"logger":"pcct","msg":"init","pcct":77443628672,"mp":77442571712,"token-ht":77446679488}
{"level":"info","ts":1688192784.0955122,"logger":"pit","msg":"init","pcct":77443628672,"pit":77443628704}
{"level":"info","ts":1688192784.0955272,"logger":"cs","msg":"init","pcct":77443628672,"cs":77443628808,"cap-memory":32767,"cap-disk":0,"cap-indirect":32767}
{"level":"info","ts":1688192784.0955389,"logger":"MinTmr","msg":"New","sched":"0x12082d8a40","slots":"4096","interval":"113333333","cb":"0xaedef0"}
{"level":"info","ts":1688192784.1104703,"logger":"pcct","msg":"init","pcct":77655441024,"mp":77654384064,"token-ht":77658491840}
{"level":"info","ts":1688192784.1104848,"logger":"pit","msg":"init","pcct":77655441024,"pit":77655441056}
{"level":"info","ts":1688192784.1104994,"logger":"cs","msg":"init","pcct":77655441024,"cs":77655441160,"cap-memory":32767,"cap-disk":0,"cap-indirect":32767}
{"level":"info","ts":1688192784.1105113,"logger":"MinTmr","msg":"New","sched":"0x1214cd8a40","slots":"4096","interval":"113333333","cb":"0xaedef0"}
{"level":"info","ts":1688192784.1174204,"logger":"eal","msg":"vdev initialized","name":"crypto_openssl_W000000000001M","args":"max_nb_queue_pairs=1,socket_id=0","socket":0}
{"level":"info","ts":1688192784.1175442,"logger":"iface","msg":"RxLoop created","rxl-ptr":77657749632}
{"level":"info","ts":1688192784.1175604,"logger":"iface","msg":"RxLoop created","rxl-ptr":77856963328}
{"level":"info","ts":1688192784.1177733,"logger":"main","msg":"activate success","role":"forwarder"}
{"level":"info","ts":1688192784.1204567,"logger":"FwCrypto.3","msg":"Run","fwc":"0x1220a7ac40","input":"0x1220a7a780","cryptodev":"0-0"}
{"level":"info","ts":1688192784.1244743,"logger":"FwFwd.4","msg":"Run","fwd-id":"0","fwd":"0x11808eeb40","fib":"0x1220bfff40","pit":"0x1183dffea0","cs":"0x1183dfff08","crypto":"0x1220a7a780"}
{"level":"info","ts":1688192784.1284695,"logger":"FwFwd.5","msg":"Run","fwd-id":"1","fwd":"0x11eb67a8c0","fib":"0x1220bfff40","pit":"0x118fbffe20","cs":"0x118fbffe88","crypto":"0x1220a7a780"}
{"level":"info","ts":1688192784.148493,"logger":"FwFwd.8","msg":"Run","fwd-id":"2","fwd":"0x11fb8d8800","fib":"0x1220bfff40","pit":"0x1207fffea0","cs":"0x1207ffff08","crypto":"0x1220a7a780"}
{"level":"info","ts":1688192784.1485798,"logger":"FwFwd.9","msg":"Run","fwd-id":"3","fwd":"0x12082d8800","fib":"0x1220bfff40","pit":"0x12149ffea0","cs":"0x12149fff08","crypto":"0x1220a7a780"}
{"level":"info","ts":1688192789.1637714,"logger":"eal","msg":"PCI device probed","addr":"0000:00:13.0","args":""}
{"level":"info","ts":1688192789.2170014,"logger":"ethdev","msg":"ethdev started","id":0,"name":"0000:00:13.0","driver":"net_e1000_em","mtu":1500,"rxq":1,"txq":1,"promisc":true}
{"level":"info","ts":1688192789.217084,"logger":"ethport","msg":"port opened","port":0,"rxImpl":"RxTable"}
{"level":"info","ts":1688192792.4526293,"logger":"ethport","msg":"face started","port":0,"id":18116}
{"level":"info","ts":1688192792.452938,"logger":"iface","msg":"face created","id":18116,"socket":0,"mtu":1500,"locator":{"local":"7e:99:12:b3:4c:b1","remote":"a2:4c:24:ac:53:6c","scheme":"ether"}}
{"level":"warn","ts":1688192802.741061,"logger":"DPDK","msg":"rte_pmd_memif_probe(): Failed to register mp action callback: Operation not supported"}
{"level":"info","ts":1688192802.741079,"logger":"eal","msg":"vdev initialized","name":"net_memifW000000000002C","args":"socket-abstract=no,mac=F2:6D:65:6D:69:66,id=0,role=server,bsize=1500,rsize=10,socket=/run/ndn/fileserver.sock","socket":"any"}
{"level":"info","ts":1688192802.7415445,"logger":"ethport","msg":"port opened","port":1,"rxImpl":"RxMemif"}
{"level":"info","ts":1688192802.7526214,"logger":"ethdev","msg":"ethdev started","id":1,"name":"net_memifW000000000002C","driver":"net_memif","mtu":1500,"rxq":1,"txq":1,"promisc":false}
{"level":"info","ts":1688192802.7526505,"logger":"ethport","msg":"face started","port":1,"id":57680}
{"level":"info","ts":1688192802.752664,"logger":"iface","msg":"face created","id":57680,"socket":0,"mtu":1500,"locator":{"dataroom":1500,"id":0,"ringCapacity":1024,"role":"server","scheme":"memif","socketName":"/run/ndn/fileserver.sock"}}
ndndpdk-svc@127.0.0.1:3030.service: Main process exited, code=killed, status=11/SEGV
ndndpdk-svc@127.0.0.1:3030.service: Failed with result 'signal'.
ndndpdk-svc@127.0.0.1:3030.service: Scheduled restart job, restart counter is at 4.
Stopped NDN-DPDK service on 127.0.0.1:3030.
Starting NDN-DPDK service on 127.0.0.1:3030...
{"level":"info","ts":1688192869.5693808,"logger":"main","msg":"NDN-DPDK service starting","version":"v0.0.0-20230606231922-08d7daa0e15a","uid":0,"linux":"5.15.0-76-generic","dpdk":"23.03.0","spdk":"23.05"}
Started NDN-DPDK service on 127.0.0.1:3030.
{"level":"info","ts":1688192869.5700214,"logger":"main","msg":"GraphQL HTTP server starting","listen":"127.0.0.1:3030"}

is there any solution for this error?

yoursunny commented 1 year ago

ndndpdk-svc@127.0.0.1:3030.service: Main process exited, code=killed, status=11/SEGV

This is definitely a bug. There was a similar complaint from @sankalpatimilsina12 last month, also related to a forwarder running alongside fileserver, but it happens infrequently so I can't find what's wrong.

Please explain how you setup everything: topology, activation parameters, face creation parameters, etc. If I can reproduce the bug, I would be able to fix it.

Syafaatdehaf commented 1 year ago

The things I prepared to start the NDN-DPDK

yoursunny commented 1 year ago

I can consistently reproduce this bug using the example procedure in fileserver.md. while ndncat file-client /fileserver/usr-local-bin /tmp/usr-local-bin-retrieved; do sleep 1; done would cause the crash within 5 iterations.

I'm able to obtain a full stacktrace if I run sudo gdb ndndpdk-svc rather than attaching to the process within systemd service:

Thread 19 "rte-worker-11" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff957fa600 (LWP 648610)]
CsEraseBatch_Append (kind=0xea827d "direct", entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:12
12        PccEntry_RemoveCsEntry(pccEntry);
(gdb) bt full
#0  CsEraseBatch_Append (kind=0xea827d "direct", entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:12
        pccEntry = 0x0
        pccEntry = <optimized out>
#1  CsEraseBatch_AddDirect (entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:42
        cs = <optimized out>
        cs = <optimized out>
        i = <optimized out>
        indirect = <optimized out>
#2  Cs_EvictEntryDirect (entry=0x19881e580, ctx=140735701536240) at ../csrc/pcct/cs.c:73
No locals.
#3  0x0000000000ae434f in CsList_EvictBulk (csl=csl@entry=0x19d7af378, max=max@entry=64, cb=cb@entry=0xae4cf0 <Cs_EvictEntryDirect>, 
    ctx=ctx@entry=140735701536240) at ../csrc/pcct/cs-list.c:24
        entry = <optimized out>
        i = 24
        nErase = 64
        node = 0x19881da00
#4  0x0000000000ae43d9 in Cs_Evict (cs=0x19d7af308, csl=0x19d7af378, cslName=<optimized out>, evictCb=0xae4cf0 <Cs_EvictEntryDirect>)
    at ../csrc/pcct/cs.c:80
        peb = {pcct = 0x19d7af280, nEntries = 24, objs = {0x19881e340, 0x19881d7c0, 0x19881dd80, 0x198292a40, 0x1995892c0, 0x199588740, 0x199587bc0, 
            0x199587040, 0x199581a00, 0x197f77900, 0x197f78480, 0x19828bd00, 0x19821fa40, 0x1982205c0, 0x1983c2ac0, 0x198221cc0, 0x198398fc0, 
            0x198399b40, 0x1998f3500, 0x1998f2980, 0x1983fce40, 0x1983fd9c0, 0x1983fe540, 0x19d7af3a0, 0x0 <repeats 168 times>}}
#5  0x0000000000acf429 in FwFwd_RxBurst (fwd=fwd@entry=0x19daafd80, pktType=PktData, q=q@entry=0x19daafdf0, process=<optimized out>, 
    process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
        ctx = {fwd = 0x19daafd80, rxTime = 1450303988849089, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x1fe3f7c40, pkt = 0x1fe3f7c40}, 
          fibEntry = 0x19718a040, fibEntryDyn = 0x19718a340, pitEntry = 0x198836100, endofSgCtx = 0x7fff957f5488, pitUp = 0x0, rxToken = {
            length = 7 '\a', value = "\273\030\001\000\000\000\001", '\000' <repeats 24 times>}, dnNonce = 0, nForwarded = 0, rxFace = 46845}
        timeSinceRx = <optimized out>
        i = 0
        now = 1450303988850409
        pkts = {0x1fe3f7c40, 0x1fbef9780, 0x1fbefbe40, 0x1fbefe500, 0x1fbf00bc0, 0x1fbf03280, 0x1fbf05940, 0x1fbf08000, 0x1fbf0a6c0, 0x1fbf0cd80, 
          0x1fbf0f440, 0x0 <repeats 45 times>, 0x2, 0x0, 0x0, 0x1, 0x7ffff7facfb0, 0x19daafda4, 0xb, 0x7fff957f5740}
        pop = {count = 1, drop = <optimized out>}
#6  0x0000000000acfa52 in FwFwd_Run (fwd=0x19daafd80) at ../csrc/fwdp/fwd.c:67
        nProcessed = 0
#7  0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
        f = <optimized out>
        fct_arg = <optimized out>
        lcore_id = 11
        cpuset = "11", '\000' <repeats 253 times>
        ret = <optimized out>
#8  0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#9  0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

This bug occurs only if NDN-DPDK is built in release mode. It does not occur when built in debug mode. It does not occur if I delete either -DNDEBUG or -DN_LOG_LEVEL=RTE_LOG_NOTICE from the build options. https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/mk/cflags.sh#L6-L8

I investigated a little and found that PccSlot.pccEntry field is unexpected zeroized, but I haven't figured out why.

yoursunny commented 1 year ago

I reduced the batch size of CsEraseBatch and PcctEraseBatch both to 1, making them erasing one entry at a time. https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/container/cs/enum.go#L9 https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/csrc/pcct/pcct.h#L84 The bug persists, but I'm getting different stacktraces.

Thread 20 "rte-worker-11" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffad49b600 (LWP 891251)]
0x0000000000ae4bea in PccEntry_RemoveCsEntry (entry=0x4e2000000cf1) at ../csrc/pcct/pcc-entry.h:218
218       PccEntry_ClearSlot_(entry, entry->csEntrySlot);
(gdb) bt full
#0  0x0000000000ae4bea in PccEntry_RemoveCsEntry (entry=0x4e2000000cf1) at ../csrc/pcct/pcc-entry.h:218
No locals.
#1  CsEraseBatch_Append (kind=<synthetic pointer>, entry=0x1941acff0, peb=0x7fffad495df0) at ../csrc/pcct/cs.c:12
        pccEntry = 0x4e2000000cf1
        pccEntry = <optimized out>
#2  CsEraseBatch_AddDirect (entry=0x1941ad038, peb=0x7fffad495df0) at ../csrc/pcct/cs.c:37
        indirect = 0x1941acff0
        i = 0
        cs = <optimized out>
        cs = <optimized out>
        i = <optimized out>
        indirect = <optimized out>
#3  Cs_EvictEntryDirect (entry=0x1941ad038, ctx=140736100654576) at ../csrc/pcct/cs.c:73
No locals.
#4  0x0000000000ae41ff in CsList_EvictBulk (csl=csl@entry=0x1941ad038, max=max@entry=1, cb=cb@entry=0xae4ba0 <Cs_EvictEntryDirect>,
    ctx=ctx@entry=140736100654576) at ../csrc/pcct/cs-list.c:24
        entry = <optimized out>
        i = 0
        nErase = 1
        node = 0x1941ad038
#5  0x0000000000ae4289 in Cs_Evict (cs=0x1941acfc8, csl=0x1941ad038, cslName=<optimized out>, evictCb=0xae4ba0 <Cs_EvictEntryDirect>)
    at ../csrc/pcct/cs.c:80
        peb = {pcct = 0x1941acf40, nEntries = 0, objs = {0x0 <repeats 192 times>}}
#6  0x0000000000acf2d9 in FwFwd_RxBurst (fwd=fwd@entry=0x1944ada40, pktType=PktData, q=q@entry=0x1944adab0, process=<optimized out>,
    process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
        ctx = {fwd = 0x1944ada40, rxTime = 1650592312430185, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x22845c0c0, pkt = 0x22845c0c0},
          fibEntry = 0x18db87d00, fibEntryDyn = 0x18db88000, pitEntry = 0x18fed5f80, endofSgCtx = 0x7fffad496488, pitUp = 0x0, rxToken = {length = 7 '\a', value = "\022\227\001\000\000\000\001", '\000' <repeats 24 times>}, dnNonce = 0, nForwarded = 0, rxFace = 44463}
        timeSinceRx = <optimized out>
        i = 0
        now = 1650592312431553
        pkts = {0x22845c0c0, 0x0 <repeats 55 times>, 0x2, 0x0, 0x0, 0x1, 0x7ffff7facfb0, 0x1944ada64, 0xb, 0x7fffad496740}
        pop = {count = 1, drop = <optimized out>}
#7  0x0000000000acf902 in FwFwd_Run (fwd=0x1944ada40) at ../csrc/fwdp/fwd.c:67
        nProcessed = 0
#8  0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
        f = <optimized out>
        fct_arg = <optimized out>
        lcore_id = 11
        cpuset = "11", '\000' <repeats 253 times>
        ret = <optimized out>
#9  0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#10 0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#11 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
Thread 15 "rte-worker-10" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa37fe600 (LWP 895071)]
CsEntry_Disassoc (indirect=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:122
122       for (; i < direct->nIndirects; ++i) {
(gdb) bt full
#0  CsEntry_Disassoc (indirect=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:122
        direct = 0x0
        i = 0 '\000'
        direct = <optimized out>
        i = <optimized out>
#1  CsEntry_Clear (entry=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:147
No locals.
#2  CsArc_MoveHandler (entry=0x17c9c77a0, src=<optimized out>, dst=<optimized out>, ctx=0) at ../csrc/pcct/cs-arc.c:36
No locals.
#3  0x0000000000ae3500 in CsArc_Replace (isB2=false, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:80
        moving = <optimized out>
        moving = <optimized out>
#4  CsArc_Replace (isB2=false, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:77
        moving = <optimized out>
#5  CsArc_AddNew (foundIn=<optimized out>, entry=0x1778cf240, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:120
        deleting = <optimized out>
        nL1 = <optimized out>
        nL1 = <optimized out>
        deleting = <optimized out>
        deleting = <optimized out>
        nL1L2 = <optimized out>
        deleting = <optimized out>
#6  CsArc_Add (arc=0x17c9c7708, entry=0x1778cf240) at ../csrc/pcct/cs-arc.c:170
        foundIn = <optimized out>
#7  0x0000000000ae50f3 in Cs_PutDirect (cs=0x17c9c7708, npkt=0x20bda0940, pccEntry=0x1778cf040) at ../csrc/pcct/cs.c:154
        pkt = 0x20bda0940
        data = 0x20bda09e8
        entry = 0x1778cf240
#8  0x0000000000ae5485 in Cs_Insert (cs=0x17c9c7708, npkt=0x20bda0940, pitFound=...) at ../csrc/pcct/cs.c:263
        pcct = 0x17c9c7680
        pkt = 0x20bda0940
        data = 0x20bda09e8
        pccEntry = <optimized out>
        interest = 0x1f6114268
        direct = 0x0
#9  0x0000000000acf2d9 in FwFwd_RxBurst (fwd=fwd@entry=0x17ccc8180, pktType=PktData, q=q@entry=0x17ccc81f0, process=<optimized out>, process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
        ctx = {fwd = 0x17ccc8180, rxTime = 1653443313747001, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x20bda0940, pkt = 0x20bda0940}, fibEntry = 0x18db87c40, fibEntryDyn = 0x18db87e80, pitEntry = 0x1778cf240, endofSgCtx = 0x7fffa37f9488, pitUp = 0x0, rxToken = {length = 7 '\a', value = "\346\320\004", '\000' <repeats 28 times>}, dnNonce = 0, nForwarded = 0, rxFace = 13886}
        timeSinceRx = <optimized out>
        i = 0
        now = 1653443313748897
        pkts = {0x20bda0940, 0x22d8183c0, 0x22d7d5dc0, 0x22d7d3700, 0x22d7d1040, 0x204d36040, 0x204d38700, 0x204d3adc0, 0x204d3d480, 0x204d3fb40, 0x204d42200, 0x204d448c0, 0x204d46f80, 0x204d49640, 0x64, 0x7ffff74ada61 <_IO_do_write+177>, 0x100, 0x0, 0xc792, 0x176b0f9d0, 0x176bb94c0, 0x176b0f9c0, 0x1777, 0x7ffff7ba3b8d <__rte_hash_del_key_with_hash+1917>, 0x3063643234626438, 0x7fff00000000, 0x42e17c9c7680, 0x1768b93c0, 0x1, 0x41f67953d72e0000, 0x7ffff7e0a400 <lcore_config+1920>, 0x177595080, 0x17c9c7680, 0x7fffa37f9690, 0x1, 0x0, 0x7ffff7e0a400 <lcore_config+1920>, 0xae70f3 <Pcct_RemoveToken+83>, 0x4c77e, 0x41f67953d72e0000, 0x7ffff7e0a400 <lcore_config+1920>, 0x177595080, 0x7fffa37f9680, 0xae714d <PcctEraseBatch_EraseBurst_+61>, 0x1ff, 0x0, 0x6000000000000000, 0xae8d77 <PitEntry_Finalize+503>, 0x3000000018, 0x1767bb380, 0x7fffa37f9670, 0x7fffa37f9690, 0x1768a9240, 0xa, 0x7fffa37f9740, 0x7ffff7e0a400 <lcore_config+1920>, 0x2, 0x0, 0x17c9c7680, 0x1, 0x7ffff7facfb0, 0x17ccc81a4, 0xa, 0x7fffa37f9740}
        pop = {count = 1, drop = <optimized out>}
#10 0x0000000000acf902 in FwFwd_Run (fwd=0x17ccc8180) at ../csrc/fwdp/fwd.c:67
        nProcessed = 0
#11 0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
        f = <optimized out>
        fct_arg = <optimized out>
        lcore_id = 10
        cpuset = "10", '\000' <repeats 253 times>
        ret = <optimized out>
#12 0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#13 0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
(gdb) p *indirect
$4 = {prev = 0x1791f51c0, next = 0x17c9c7730, pccEntry = 0x4e2000000013, zeroizeBegin_ = 0x17c9c77b8, {data = 0x0, diskSlot = 0, direct = 0x0},
  freshUntil = 0, kind = CsEntryIndirect, nIndirects = 236 '\354', arcList = CslDirectB1, zeroizeEnd_ = 0x17c9c77d0, indirect = {0x0, 0x0, 0x0, 0x0},
  diskStored = {pktLen = 0, saveTotal = 0, saveLen = {0, 0, 32768, 0, 0, 0, 4736, 63456, 32767, 0, 21824, 31884, 1, 0, 30912, 31900, 1, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 51272, 63455, 32767, 0, 1},
    headTail = "\000\000\000\000\000\000\300\000\000\000\000\000\000\000\000\000\000@\001\000\000\000\000\000\000@\000\000\000\000"}}
(gdb) p *(CsList*)indirect
$5 = {prev = 0x1791f51c0, next = 0x17c9c7730, count = 19, capacity = 20000}

It appears that the code is somehow mixing up CsEntry* and CsList* types. It is trying to erase CsList* pointer as if it is a CsEntry*. Consequently, it is interpreting invalid memory location as PccEntry* and that causes the crash.

yoursunny commented 1 year ago

So far I've found one logic error here: https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/csrc/pcct/cs.c#L201-L203

This snippet is invoked in the following condition:

  1. An initial Interest /A/1 CanBePrefix=0 MustBeFresh=1 brought back Data with same name /A/1, so that there's a direct entry at /A/1.
  2. The initial cached Data has exceeded its FreshnessPeriod.
  3. A new Interest /A/1 CanBePrefix=1 MustBeFresh=1 has a cache miss due to violating MustBeFresh.
  4. The Data reply has name /A/1/Z, so that we need to insert a direct entry at /A/1/Z and an indirect entry at /A/1.
  5. Since the PCCT design allows only one CS entry at each name, the initial direct entry at /A/1 shall be replaced with an indirect entry at /A/1.

In the quoted snippet, the mbuf on the direct entry is released via CsEntry_Clear, but the entry is not removed from its previous CsList. Effectively, the same entry would belong to both a direct CsList (T1 or T2) and the indirect CsList. The fix is calling CsArc_Remove in this snippet.

However, I'm unsure whether this is causing the crash during fileserver usage. As described above, the snippet is invoked only if the same Interest name could bring back both an exact-name Data and a longer-name Data, but the file server does not produce such Data.

yoursunny commented 1 year ago

I performed two 1-hour tests after 29161b8928d9de7e24cfceab07e96046d1d68060. The crash did not occur. Thus, I believe this bug is now fixed.