Closed Syafaatdehaf closed 1 year ago
ndndpdk-svc@127.0.0.1:3030.service: Main process exited, code=killed, status=11/SEGV
This is definitely a bug. There was a similar complaint from @sankalpatimilsina12 last month, also related to a forwarder running alongside fileserver, but it happens infrequently so I can't find what's wrong.
Please explain how you setup everything: topology, activation parameters, face creation parameters, etc. If I can reproduce the bug, I would be able to fix it.
The things I prepared to start the NDN-DPDK
I activated this NDN-DPDK using a PROXMOX VM.
For topology, I use Point to Point topology.
The dpdk-hugepages I setup are 8G in size
The dpdk-driver I use is IGB_UIO
I activated ndndpdk with this command sudo ndndpdk-ctrl --gqlserver http://127.0.0.1:3030 systemd start
The interface that I use is 00:13.0
for Node A as a server and interface 00:13.0
for Node B as a client.
To enable NDN-DPDK as a forwarder I use a json file like this {}
to create the port I used --pci 00:13.0
for node A and Node B usingMTU 1500
when NDNPING
can be connected between node A and Node B.
Next, I created a face to activate the file server with the command
FACEID=$(jq -n '{
scheme: "memif",
socketName: "/run/ndn/fileserver.sock",
id: 0,
role: "server",
dataroom: 1500
}' | ndndpdk-ctrl --gqlserver http://127.0.0.1:3030 create-face tee /dev/stderr | jq -r .id)
Next, I created a face to activate the file server with the command
sudo ndndpdk-ctrl --gqlserver http://127.0.0.1:3031 systemd start
Next I used the command to connect with the json file for the file server
sudo ndndpdk-ctrl --gqlserver http://127.0.0.1:3031/ activate-fileserver < activate/coba.json
Next, I created a face to activate the file server with the command
sudo ndndpdk-ctrl --gqlserver http://127.0.0.1:3031 systemd start
Next I used the command to connect with the json file for the file server
sudo ndndpdk-ctrl --gqlserver http://127.0.0.1:3031/ activate-fileserver < activate/coba.json
the contents of the coba.json file
{
"eal": {
"coresPerNuma": {
"0": 4,
"1": 0
},
"memPerNuma": {
"0": 4096,
"1": 0
},
"filePrefix": "fileserver"
},
"mempool": {
"DIRECT": {
"capacity": 65535,
"dataroom": 9146
},
"INDIRECT": {
"capacity": 65535
},
"PAYLOAD": {
"capacity": 65535,
"dataroom": 9146
}
},
"face": {
"scheme": "memif",
"socketName": "/run/ndn/fileserver.sock",
"id": 0,
"role": "client",
"dataroom": 1500
},
"fileServer": {
"nThreads": 1,
"mounts": [
{
"prefix": "/fileserver/home-ndn-coba-6s",
"path": "/home/ndn/coba/6s"
},
{
"prefix": "/fileserver/usr-local-lib",
"path": "/usr/local/lib"
},
{
"prefix": "/fileserver/usr-local-share",
"path": "/usr/local/share"
}
],
"segmentLen": 6144,
"uringCapacity": 4096
}
}
When successful and has an output of true
then insert fib on the client Node or Node B with commmand ndndpdk-ctrl insert-fib --name /fileserver --nh (enter id according to output on insert fib forwarder)
perform the export command NDNTS_UPLINK=ndndpdk-udp:
on the client side
perform anexport
command
NDNTS_NDNDPDK_GQLSERVER=http://127.0.0.1:3030/
on client side
then perform the alias ndncat='npx -y -p https://ndnts-nightly.ndn.today/cat.tgz ndncat
command on the client side.
Perform a cat with the command ndncat get-segmented --ver=rdr /fileserver/home-ndn-coba-6s/hello.txt/output.m3u8 > /tmp/output.m3u8.retrieved
successfully does not have any output, when the command cat /tmp/output.m3u8.retrieved
has the output of the contents of the file output.m3u8
.
When sending content of 800 bytes successfully, when it will be tried with a file of approximately 2 MB, the error as above appears
I can consistently reproduce this bug using the example procedure in fileserver.md.
while ndncat file-client /fileserver/usr-local-bin /tmp/usr-local-bin-retrieved; do sleep 1; done
would cause the crash within 5 iterations.
I'm able to obtain a full stacktrace if I run sudo gdb ndndpdk-svc
rather than attaching to the process within systemd service:
Thread 19 "rte-worker-11" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff957fa600 (LWP 648610)]
CsEraseBatch_Append (kind=0xea827d "direct", entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:12
12 PccEntry_RemoveCsEntry(pccEntry);
(gdb) bt full
#0 CsEraseBatch_Append (kind=0xea827d "direct", entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:12
pccEntry = 0x0
pccEntry = <optimized out>
#1 CsEraseBatch_AddDirect (entry=0x19881e580, peb=0x7fff957f4df0) at ../csrc/pcct/cs.c:42
cs = <optimized out>
cs = <optimized out>
i = <optimized out>
indirect = <optimized out>
#2 Cs_EvictEntryDirect (entry=0x19881e580, ctx=140735701536240) at ../csrc/pcct/cs.c:73
No locals.
#3 0x0000000000ae434f in CsList_EvictBulk (csl=csl@entry=0x19d7af378, max=max@entry=64, cb=cb@entry=0xae4cf0 <Cs_EvictEntryDirect>,
ctx=ctx@entry=140735701536240) at ../csrc/pcct/cs-list.c:24
entry = <optimized out>
i = 24
nErase = 64
node = 0x19881da00
#4 0x0000000000ae43d9 in Cs_Evict (cs=0x19d7af308, csl=0x19d7af378, cslName=<optimized out>, evictCb=0xae4cf0 <Cs_EvictEntryDirect>)
at ../csrc/pcct/cs.c:80
peb = {pcct = 0x19d7af280, nEntries = 24, objs = {0x19881e340, 0x19881d7c0, 0x19881dd80, 0x198292a40, 0x1995892c0, 0x199588740, 0x199587bc0,
0x199587040, 0x199581a00, 0x197f77900, 0x197f78480, 0x19828bd00, 0x19821fa40, 0x1982205c0, 0x1983c2ac0, 0x198221cc0, 0x198398fc0,
0x198399b40, 0x1998f3500, 0x1998f2980, 0x1983fce40, 0x1983fd9c0, 0x1983fe540, 0x19d7af3a0, 0x0 <repeats 168 times>}}
#5 0x0000000000acf429 in FwFwd_RxBurst (fwd=fwd@entry=0x19daafd80, pktType=PktData, q=q@entry=0x19daafdf0, process=<optimized out>,
process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
ctx = {fwd = 0x19daafd80, rxTime = 1450303988849089, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x1fe3f7c40, pkt = 0x1fe3f7c40},
fibEntry = 0x19718a040, fibEntryDyn = 0x19718a340, pitEntry = 0x198836100, endofSgCtx = 0x7fff957f5488, pitUp = 0x0, rxToken = {
length = 7 '\a', value = "\273\030\001\000\000\000\001", '\000' <repeats 24 times>}, dnNonce = 0, nForwarded = 0, rxFace = 46845}
timeSinceRx = <optimized out>
i = 0
now = 1450303988850409
pkts = {0x1fe3f7c40, 0x1fbef9780, 0x1fbefbe40, 0x1fbefe500, 0x1fbf00bc0, 0x1fbf03280, 0x1fbf05940, 0x1fbf08000, 0x1fbf0a6c0, 0x1fbf0cd80,
0x1fbf0f440, 0x0 <repeats 45 times>, 0x2, 0x0, 0x0, 0x1, 0x7ffff7facfb0, 0x19daafda4, 0xb, 0x7fff957f5740}
pop = {count = 1, drop = <optimized out>}
#6 0x0000000000acfa52 in FwFwd_Run (fwd=0x19daafd80) at ../csrc/fwdp/fwd.c:67
nProcessed = 0
#7 0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
f = <optimized out>
fct_arg = <optimized out>
lcore_id = 11
cpuset = "11", '\000' <repeats 253 times>
ret = <optimized out>
#8 0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#9 0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
This bug occurs only if NDN-DPDK is built in release mode.
It does not occur when built in debug mode.
It does not occur if I delete either -DNDEBUG
or -DN_LOG_LEVEL=RTE_LOG_NOTICE
from the build options.
https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/mk/cflags.sh#L6-L8
I investigated a little and found that PccSlot.pccEntry
field is unexpected zeroized, but I haven't figured out why.
I reduced the batch size of CsEraseBatch
and PcctEraseBatch
both to 1, making them erasing one entry at a time.
https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/container/cs/enum.go#L9
https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/csrc/pcct/pcct.h#L84
The bug persists, but I'm getting different stacktraces.
Thread 20 "rte-worker-11" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffad49b600 (LWP 891251)]
0x0000000000ae4bea in PccEntry_RemoveCsEntry (entry=0x4e2000000cf1) at ../csrc/pcct/pcc-entry.h:218
218 PccEntry_ClearSlot_(entry, entry->csEntrySlot);
(gdb) bt full
#0 0x0000000000ae4bea in PccEntry_RemoveCsEntry (entry=0x4e2000000cf1) at ../csrc/pcct/pcc-entry.h:218
No locals.
#1 CsEraseBatch_Append (kind=<synthetic pointer>, entry=0x1941acff0, peb=0x7fffad495df0) at ../csrc/pcct/cs.c:12
pccEntry = 0x4e2000000cf1
pccEntry = <optimized out>
#2 CsEraseBatch_AddDirect (entry=0x1941ad038, peb=0x7fffad495df0) at ../csrc/pcct/cs.c:37
indirect = 0x1941acff0
i = 0
cs = <optimized out>
cs = <optimized out>
i = <optimized out>
indirect = <optimized out>
#3 Cs_EvictEntryDirect (entry=0x1941ad038, ctx=140736100654576) at ../csrc/pcct/cs.c:73
No locals.
#4 0x0000000000ae41ff in CsList_EvictBulk (csl=csl@entry=0x1941ad038, max=max@entry=1, cb=cb@entry=0xae4ba0 <Cs_EvictEntryDirect>,
ctx=ctx@entry=140736100654576) at ../csrc/pcct/cs-list.c:24
entry = <optimized out>
i = 0
nErase = 1
node = 0x1941ad038
#5 0x0000000000ae4289 in Cs_Evict (cs=0x1941acfc8, csl=0x1941ad038, cslName=<optimized out>, evictCb=0xae4ba0 <Cs_EvictEntryDirect>)
at ../csrc/pcct/cs.c:80
peb = {pcct = 0x1941acf40, nEntries = 0, objs = {0x0 <repeats 192 times>}}
#6 0x0000000000acf2d9 in FwFwd_RxBurst (fwd=fwd@entry=0x1944ada40, pktType=PktData, q=q@entry=0x1944adab0, process=<optimized out>,
process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
ctx = {fwd = 0x1944ada40, rxTime = 1650592312430185, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x22845c0c0, pkt = 0x22845c0c0},
fibEntry = 0x18db87d00, fibEntryDyn = 0x18db88000, pitEntry = 0x18fed5f80, endofSgCtx = 0x7fffad496488, pitUp = 0x0, rxToken = {length = 7 '\a', value = "\022\227\001\000\000\000\001", '\000' <repeats 24 times>}, dnNonce = 0, nForwarded = 0, rxFace = 44463}
timeSinceRx = <optimized out>
i = 0
now = 1650592312431553
pkts = {0x22845c0c0, 0x0 <repeats 55 times>, 0x2, 0x0, 0x0, 0x1, 0x7ffff7facfb0, 0x1944ada64, 0xb, 0x7fffad496740}
pop = {count = 1, drop = <optimized out>}
#7 0x0000000000acf902 in FwFwd_Run (fwd=0x1944ada40) at ../csrc/fwdp/fwd.c:67
nProcessed = 0
#8 0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
f = <optimized out>
fct_arg = <optimized out>
lcore_id = 11
cpuset = "11", '\000' <repeats 253 times>
ret = <optimized out>
#9 0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#10 0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#11 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
Thread 15 "rte-worker-10" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa37fe600 (LWP 895071)]
CsEntry_Disassoc (indirect=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:122
122 for (; i < direct->nIndirects; ++i) {
(gdb) bt full
#0 CsEntry_Disassoc (indirect=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:122
direct = 0x0
i = 0 '\000'
direct = <optimized out>
i = <optimized out>
#1 CsEntry_Clear (entry=0x17c9c77a0) at ../csrc/pcct/cs-entry.h:147
No locals.
#2 CsArc_MoveHandler (entry=0x17c9c77a0, src=<optimized out>, dst=<optimized out>, ctx=0) at ../csrc/pcct/cs-arc.c:36
No locals.
#3 0x0000000000ae3500 in CsArc_Replace (isB2=false, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:80
moving = <optimized out>
moving = <optimized out>
#4 CsArc_Replace (isB2=false, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:77
moving = <optimized out>
#5 CsArc_AddNew (foundIn=<optimized out>, entry=0x1778cf240, arc=0x17c9c7708) at ../csrc/pcct/cs-arc.c:120
deleting = <optimized out>
nL1 = <optimized out>
nL1 = <optimized out>
deleting = <optimized out>
deleting = <optimized out>
nL1L2 = <optimized out>
deleting = <optimized out>
#6 CsArc_Add (arc=0x17c9c7708, entry=0x1778cf240) at ../csrc/pcct/cs-arc.c:170
foundIn = <optimized out>
#7 0x0000000000ae50f3 in Cs_PutDirect (cs=0x17c9c7708, npkt=0x20bda0940, pccEntry=0x1778cf040) at ../csrc/pcct/cs.c:154
pkt = 0x20bda0940
data = 0x20bda09e8
entry = 0x1778cf240
#8 0x0000000000ae5485 in Cs_Insert (cs=0x17c9c7708, npkt=0x20bda0940, pitFound=...) at ../csrc/pcct/cs.c:263
pcct = 0x17c9c7680
pkt = 0x20bda0940
data = 0x20bda09e8
pccEntry = <optimized out>
interest = 0x1f6114268
direct = 0x0
#9 0x0000000000acf2d9 in FwFwd_RxBurst (fwd=fwd@entry=0x17ccc8180, pktType=PktData, q=q@entry=0x17ccc81f0, process=<optimized out>, process=<optimized out>, pktType=PktData) at ../csrc/fwdp/fwd.c:46
ctx = {fwd = 0x17ccc8180, rxTime = 1653443313747001, eventKind = SGEVT_DATA, nhFlt = 4294967295, {npkt = 0x20bda0940, pkt = 0x20bda0940}, fibEntry = 0x18db87c40, fibEntryDyn = 0x18db87e80, pitEntry = 0x1778cf240, endofSgCtx = 0x7fffa37f9488, pitUp = 0x0, rxToken = {length = 7 '\a', value = "\346\320\004", '\000' <repeats 28 times>}, dnNonce = 0, nForwarded = 0, rxFace = 13886}
timeSinceRx = <optimized out>
i = 0
now = 1653443313748897
pkts = {0x20bda0940, 0x22d8183c0, 0x22d7d5dc0, 0x22d7d3700, 0x22d7d1040, 0x204d36040, 0x204d38700, 0x204d3adc0, 0x204d3d480, 0x204d3fb40, 0x204d42200, 0x204d448c0, 0x204d46f80, 0x204d49640, 0x64, 0x7ffff74ada61 <_IO_do_write+177>, 0x100, 0x0, 0xc792, 0x176b0f9d0, 0x176bb94c0, 0x176b0f9c0, 0x1777, 0x7ffff7ba3b8d <__rte_hash_del_key_with_hash+1917>, 0x3063643234626438, 0x7fff00000000, 0x42e17c9c7680, 0x1768b93c0, 0x1, 0x41f67953d72e0000, 0x7ffff7e0a400 <lcore_config+1920>, 0x177595080, 0x17c9c7680, 0x7fffa37f9690, 0x1, 0x0, 0x7ffff7e0a400 <lcore_config+1920>, 0xae70f3 <Pcct_RemoveToken+83>, 0x4c77e, 0x41f67953d72e0000, 0x7ffff7e0a400 <lcore_config+1920>, 0x177595080, 0x7fffa37f9680, 0xae714d <PcctEraseBatch_EraseBurst_+61>, 0x1ff, 0x0, 0x6000000000000000, 0xae8d77 <PitEntry_Finalize+503>, 0x3000000018, 0x1767bb380, 0x7fffa37f9670, 0x7fffa37f9690, 0x1768a9240, 0xa, 0x7fffa37f9740, 0x7ffff7e0a400 <lcore_config+1920>, 0x2, 0x0, 0x17c9c7680, 0x1, 0x7ffff7facfb0, 0x17ccc81a4, 0xa, 0x7fffa37f9740}
pop = {count = 1, drop = <optimized out>}
#10 0x0000000000acf902 in FwFwd_Run (fwd=0x17ccc8180) at ../csrc/fwdp/fwd.c:67
nProcessed = 0
#11 0x00007ffff7da8175 in eal_thread_loop (arg=<optimized out>) at ../lib/eal/common/eal_common_thread.c:210
f = <optimized out>
fct_arg = <optimized out>
lcore_id = 10
cpuset = "10", '\000' <repeats 253 times>
ret = <optimized out>
#12 0x00007ffff7dcbf2f in eal_worker_thread_loop (arg=<optimized out>) at ../lib/eal/linux/eal.c:915
No locals.
#13 0x00007ffff74b5b43 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x00007ffff7547a00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
(gdb) p *indirect
$4 = {prev = 0x1791f51c0, next = 0x17c9c7730, pccEntry = 0x4e2000000013, zeroizeBegin_ = 0x17c9c77b8, {data = 0x0, diskSlot = 0, direct = 0x0},
freshUntil = 0, kind = CsEntryIndirect, nIndirects = 236 '\354', arcList = CslDirectB1, zeroizeEnd_ = 0x17c9c77d0, indirect = {0x0, 0x0, 0x0, 0x0},
diskStored = {pktLen = 0, saveTotal = 0, saveLen = {0, 0, 32768, 0, 0, 0, 4736, 63456, 32767, 0, 21824, 31884, 1, 0, 30912, 31900, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 51272, 63455, 32767, 0, 1},
headTail = "\000\000\000\000\000\000\300\000\000\000\000\000\000\000\000\000\000@\001\000\000\000\000\000\000@\000\000\000\000"}}
(gdb) p *(CsList*)indirect
$5 = {prev = 0x1791f51c0, next = 0x17c9c7730, count = 19, capacity = 20000}
It appears that the code is somehow mixing up CsEntry*
and CsList*
types.
It is trying to erase CsList*
pointer as if it is a CsEntry*
.
Consequently, it is interpreting invalid memory location as PccEntry*
and that causes the crash.
So far I've found one logic error here: https://github.com/usnistgov/ndn-dpdk/blob/99c0c942eb2a38041aef2202b864a0dcf5a22741/csrc/pcct/cs.c#L201-L203
This snippet is invoked in the following condition:
/A/1
CanBePrefix=0 MustBeFresh=1 brought back Data with same name /A/1
, so that there's a direct entry at /A/1
./A/1
CanBePrefix=1 MustBeFresh=1 has a cache miss due to violating MustBeFresh./A/1/Z
, so that we need to insert a direct entry at /A/1/Z
and an indirect entry at /A/1
./A/1
shall be replaced with an indirect entry at /A/1
.In the quoted snippet, the mbuf on the direct entry is released via CsEntry_Clear
, but the entry is not removed from its previous CsList.
Effectively, the same entry would belong to both a direct CsList (T1 or T2) and the indirect CsList.
The fix is calling CsArc_Remove
in this snippet.
However, I'm unsure whether this is causing the crash during fileserver usage. As described above, the snippet is invoked only if the same Interest name could bring back both an exact-name Data and a longer-name Data, but the file server does not produce such Data.
I performed two 1-hour tests after 29161b8928d9de7e24cfceab07e96046d1d68060. The crash did not occur. Thus, I believe this bug is now fixed.
When I send NDN Video Streaming files using NDN-DPDK it always has the output as below:
when the
interest rejected
error is output, the forwarder on the server side automatically shuts downand this is the NDN-DPDK service output on the server node:
is there any solution for this error?