Bug in lib.ctable resizing?

eugeneia commented 1 year ago

Hit this non-deterministic failure while hacking. This issue is a note in case it pops up again.

max@spare-NFG2:~/snabb/src$ sudo ./snabb snsh -t program.ipfix.tests.test
program/ipfix/tests/test_v4_v6_dnshttp.conf: loading compiled configuration from program/ipfix/tests/test_v4_v6_dnshttp.o
program/ipfix/tests/test_v4_v6_dnshttp.conf: compiled configuration is up to date.
2022-11-24 13:28:07: INFO: Manager has started (PID 3982717)
No CPUs available; not binding to any NUMA node.
SOFTWARE RSS
pcap_input      program/ipfix/tests/sanitized500k_truncated128.pcap
Warning: No assignable CPUs declared; leaving data-plane worker 'rss1' without assigned CPU.
2022-11-24 13:28:07: INFO: Starting worker rss1.
2022-11-24 13:28:07: INFO: Worker rss1 has started (PID 3982719).
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 1267, 5410
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 5410, 5588
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 5588, 6805
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 6805, 8218
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 8218, 8220
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 8220, 8403
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 8403, 8473
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 8473, 8674
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 8674, 10310
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 10310, 14413
MAC-to-AS map: amibguous mapping: 5c:5e:ab:b0:3f:c0: 1820, 15772
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 14413, 24218
MAC-to-AS map: amibguous mapping: 20:d8:0b:c1:67:4a: 20773, 26496
MAC-to-AS map: amibguous mapping: 00:25:90:0a:0a:bd: 24218, 36408
MAC-to-AS map: amibguous mapping: 9e:8a:cb:59:e1:23: 4455, 43531
MAC-to-AS map: amibguous mapping: 28:c0:da:fa:2f:f0: 8717, 44901
MAC-to-AS map: amibguous mapping: a8:d0:e5:5f:4c:20: 1140, 48283
MAC-to-AS map: amibguous mapping: 5c:5e:ab:d2:a1:90: 47626, 52041
MAC-to-AS map: amibguous mapping: 00:d0:f6:c5:bf:88: 48526, 56665
MAC-to-AS map: amibguous mapping: 28:8a:1c:ee:69:52: 47764, 60476
MAC-to-AS map: amibguous mapping: 20:d8:0b:f3:e4:a8: 3399, 197595
MAC-to-AS map: amibguous mapping: 84:2b:2b:fc:37:06: 47869, 199330
MAC-to-AS map: amibguous mapping: 00:24:38:8a:40:00: 29073, 202425
MAC-to-AS map: amibguous mapping: 40:01:7a:41:a1:86: 37497, 328832
Waiting for exported flows...Nov 24 2022 13:28:12 [3982719] IPFIX exporter: Added template v4_extended(#1256)
..Nov 24 2022 13:28:17 [3982719] IPFIX exporter: Added template v6_extended(#1512)
.Nov 24 2022 13:28:17 [3982719] IPFIX exporter: Added template v4_HTTP(#257)
Nov 24 2022 13:28:17 [3982719] IPFIX exporter: Added template v4_DNS(#258)
Nov 24 2022 13:28:18 [3982719] IPFIX exporter: Added template v6_HTTP(#513)
Nov 24 2022 13:28:18 [3982719] IPFIX exporter: Added template v6_DNS(#514)
line not found
core/main.lua:26: key is already present in ctable

Stack Traceback
===============
(1) Lua function 'handler' at file 'core/main.lua:171' (best guess)
        Local variables:
         (*temporary) = string: "core/main.lua:26: key is already present in ctable"
(2) global C function 'error'
(3) Lua global 'assert' at file 'core/main.lua:26'
        Local variables:
         (*temporary) = nil
(4) Lua method 'add' at file 'lib/ctable.lua:360'
        Local variables:
         (*temporary) = table: 0x7f2c57020780  {scale:5.820766092702e-06, lookup_helper:cdata<struct 2188 *(*)()>: 0x7f2c6d99a000 (more...)}
         (*temporary) = nil
         (*temporary) = number: 5.38878e+08
         (*temporary) = number: 5.82077e-06
         (*temporary) = number: 3136
         (*temporary) = number: 3136
(5) Lua method 'resize' at file 'lib/ctable.lua:255'
        Local variables:
         (*temporary) = table: 0x7f2c57020780  {scale:5.820766092702e-06, lookup_helper:cdata<struct 2188 *(*)()>: 0x7f2c6d99a000 (more...)}
         (*temporary) = number: 25000
         (*temporary) = number: 50000
         (*temporary) = number: 0
         (*temporary) = number: 25060
         (*temporary) = number: 49127
         (*temporary) = number: 49999
         (*temporary) = number: 1
         (*temporary) = number: 49127
(6) Lua method 'remove_ptr' at file 'lib/ctable.lua:454'
        Local variables:
         p = table: 0x7f2c57020780  {scale:5.820766092702e-06, lookup_helper:cdata<struct 2188 *(*)()>: 0x7f2c6d99a000 (more...)}
         (*temporary) = number: 1.16415e-05
         (*temporary) = number: 49116
(7) Lua method 'expire_records' at file 'apps/ipfix/ipfix.lua:521'
        Local variables:
         � = table: 0x7f2c5705e2f0  {scratch_entry:cdata<struct 2188>: 0x7f2c5702efb0, scan_time:0.1, record_count:0 (more...)}
         (*temporary) = number: 1.66929e+09
         (*temporary) = number: 49115
         (*temporary) = number: 2000
         (*temporary) = number: 2000
         (*temporary) = number: 49116
         (*temporary) = number: 500000
         (*temporary) = number: 1
         (*temporary) = number: 49116
(8) Lua method 'tick' at file 'apps/ipfix/ipfix.lua:847'
        Local variables:
         9 = table: 0x7f2c57a77280  {transport_headers:table: 0x7f2c57a7b720, version:10, shm:table: 0x7f2c53792850 (more...)}
         (*temporary) = number: 1.66929e+09
         (*temporary) = C function: builtin#6
         (*temporary) = table: 0x7f2c57a773f8  {1:table: 0x7f2c57185818, 2:table: 0x7f2c5705e2f0, 3:table: 0x7f2c5473a1e8 (more...)}
         (*temporary) = number: 2
         (*temporary) = number: 2
         (*temporary) = table: 0x7f2c5705e2f0  {scratch_entry:cdata<struct 2188>: 0x7f2c5702efb0, scan_time:0.1, record_count:0 (more...)}
(9) Lua upvalue '' at file 'core/app.lua:615'
        Local variables:
         (*temporary) = number: 8
         (*temporary) = C function: builtin#6
         (*temporary) = table: 0x7f2c57ab0db8  {1:table: 0x7f2c57a77280, 2:table: 0x7f2c7e4be360, 3:table: 0x7f2c7e879c78 (more...)}
         (*temporary) = number: 1
         (*temporary) = number: 1
         (*temporary) = table: 0x7f2c57a77280  {transport_headers:table: 0x7f2c57a7b720, version:10, shm:table: 0x7f2c53792850 (more...)}
(10) Lua field 'breathe' at file 'core/histogram.lua:98'
        Local variables:
         (*temporary) = number: 4.23716e+06
(11) Lua field 'main' at file 'lib/ptree/worker.lua:110'
        Local variables:
         (*temporary) = table: 0x7f2c7ed11528  {no_report:false, channel:table: 0x7f2c7ed115f0, duration:inf, period:0.001 (more...)}
         (*temporary) = number: inf
         (*temporary) = number: 4.23715e+06
(12) main chunk of [string "require('lib.scheduling').apply({["profile"]=..."] at line 2
(13) Lua function 'main' at file 'core/main.lua:61' (best guess)
        Local variables:
         (*temporary) = string: "require('lib.scheduling').apply({[\"profile\"]=true,[\"jit_opt\"]={[\"maxtrace\"]=8000,[\"maxrecord\"]=50000,[\"sizemcode\"]=256,[\"maxside\"]=10000,[\"maxmcode\"]=8192,[\"maxsnap\"]=20000},[\"busywait\"]=false,[\"real_time\"]=false})\
require('lib.ptree.worker').main()"
         (*temporary) = Lua function '?' (defined at line 0 of chunk "require('lib.scheduling').apply({["profile"]=..."])
(14) global C function 'xpcall'
(15) main chunk of file 'core/main.lua' at line 256
(16)  C function 'require'
(17) global C function 'pcall'
(18) main chunk of file 'core/startup.lua' at line 3
(19) global C function 'require'
(20) main chunk of [string "require "core.startup""] at line 1

Error while running fiber: core/shm.lua:37: shm open error (3982719/ipfix_templates/dnshttp_1/513/table_scan_time.counter):No such file or directory
2022-11-24 13:28:18: WARN: Worker rss1 (pid 3982719) crashed with status 1!
2022-11-24 13:28:18: INFO: Restart intensity for worker rss1 is at: 1.0/0.0
2022-11-24 13:28:18: WARN: Too many worker crashes, exiting!
2022-11-24 13:28:18: WARN: lib/fibers/file.lua:29: cancelled
Error while running fiber: lib/fibers/file.lua:29: cancelled
Error while running fiber: lib/ptree/inotify.lua:57: cancelled
Error while running fiber: lib/ptree/inotify.lua:57: cancelled
2022-11-24 13:28:18: INFO: Shutdown complete.
...................^[[A^[[B.^C

eugeneia commented 1 year ago

ping @alexandergall

alexandergall commented 1 year ago

I can confirm this issue also pops up in a production environment. I don't think the problem is in the resize method, though. What seems to happen is that the table gets corrupted before the flow set's expire_records is called. The corruption overwrites the HASH_MAX value (as well as the key and value) of some entries so when the table is walked by the next_entry method, there appear to be valid entries in the table. This can be verified by inserting an assert(self.table.occupancy > 0).

What happens next is that a bogus record may look like an idle flow and is removed by a call to remove_ptr(). This will trigger a table resize (shrink) because the occupancy is really zero. The key is already present in ctable assertion is triggered when the old table is copied to the new table when multiple corrupt records exist that happen to have the same key.

I was only able to reproduce the problem in a multi-process setup. I also have strong evidence that it is related to memory-mapped huge pages: the effect seems to disappear if ctable memory is allocated from non-huge page memory, i.e. by setting

local try_huge_pages = false

in lib.ctable. This could indicate a conflict with DMA packet-memory of another process, which is also using huge pages.

alexandergall commented 12 months ago

The corruption occurs in chunks of 64 bytes. For example

00 00 00 00 00 00 00 00 00 00 00 00 20 00 92 32 55 00 00 00 15 84 00 00 00 00 00 00 07 48 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00 00 05 1C 00 00 01 1C 92 0E 32 04 FF 00 00 00 E2 4A 5E E1

I have identified this as a "completion queue entry" (CQE) as described in section 7.12.1.1 of the PRM. For example, the bytes 07 48 at offset 0x1C indicates l4_ok, l3_ok, l2_ok, no fragmentation, TCP header with ACK, IPv4. The high-nibble of the last byte is the opcode, in this case 0xE, which indicates "Responder error". The error syndrome is in byte 0x37. The value 0x04 indicates "Local Protection Error".

It is yet unclear, why this error is raised and why the CQE is posted to an address way outside the CQ. Also interesting is that this error is usually followed by more that have 0x22 as error code, which is a generic "Abort error", e.g.

00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 99 55 00 00 00 15 84 00 00 00 00 00 00 07 48 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00 00 05 1C 00 00 01 1C 00 0E 99 22 00 00 00 00 E2 4B 87 E1

The wqe_counter at offset 0x3C is monotonically increasing for subsequent CQEs, i.e. the errors are generated for a batch of WQEs.

It also seems like the corruption either occurs very early on and then only once or not at all.

alexandergall commented 11 months ago

The source of the problem is that a non-clean shutdown of a Snabb process does not properly shut down the NIC. The NIC continues to receive packets and writes the CQEs to the physical memory pages that were assigned to it by the process that has exited. The same page can be re-mapped by a new process which leads to the corruption.

The generic shutdown mechanism has a provision to unset the bus master even for a non-orderly shutdown of a worker process in lib.hardware.pci.shutdown(). However, a trivial bug has prevented this mechanism from working. #1516 hopefully fixes this.

snabbco / snabb

Bug in lib.ctable resizing? #1491