Closed eugeneia closed 10 months ago
ping @alexandergall
I can confirm this issue also pops up in a production environment. I don't think the problem is in the resize
method, though. What seems to happen is that the table gets corrupted before the flow set's expire_records
is called. The corruption overwrites the HASH_MAX
value (as well as the key and value) of some entries so when the table is walked by the next_entry
method, there appear to be valid entries in the table. This can be verified by inserting an assert(self.table.occupancy > 0)
.
What happens next is that a bogus record may look like an idle flow and is removed by a call to remove_ptr()
. This will trigger a table resize (shrink) because the occupancy is really zero. The key is already present in ctable
assertion is triggered when the old table is copied to the new table when multiple corrupt records exist that happen to have the same key.
I was only able to reproduce the problem in a multi-process setup. I also have strong evidence that it is related to memory-mapped huge pages: the effect seems to disappear if ctable
memory is allocated from non-huge page memory, i.e. by setting
local try_huge_pages = false
in lib.ctable
. This could indicate a conflict with DMA packet-memory of another process, which is also using huge pages.
The corruption occurs in chunks of 64 bytes. For example
00 00 00 00 00 00 00 00 00 00 00 00 20 00 92 32 55 00 00 00 15 84 00 00 00 00 00 00 07 48 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00 00 05 1C 00 00 01 1C 92 0E 32 04 FF 00 00 00 E2 4A 5E E1
I have identified this as a "completion queue entry" (CQE) as described in section 7.12.1.1 of the PRM. For example, the bytes 07 48
at offset 0x1C indicates l4_ok
, l3_ok
, l2_ok
, no fragmentation, TCP header with ACK, IPv4. The high-nibble of the last byte is the opcode, in this case 0xE
, which indicates "Responder error". The error syndrome is in byte 0x37. The value 0x04 indicates "Local Protection Error".
It is yet unclear, why this error is raised and why the CQE is posted to an address way outside the CQ. Also interesting is that this error is usually followed by more that have 0x22 as error code, which is a generic "Abort error", e.g.
00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 99 55 00 00 00 15 84 00 00 00 00 00 00 07 48 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00 00 05 1C 00 00 01 1C 00 0E 99 22 00 00 00 00 E2 4B 87 E1
The wqe_counter
at offset 0x3C is monotonically increasing for subsequent CQEs, i.e. the errors are generated for a batch of WQEs.
It also seems like the corruption either occurs very early on and then only once or not at all.
The source of the problem is that a non-clean shutdown of a Snabb process does not properly shut down the NIC. The NIC continues to receive packets and writes the CQEs to the physical memory pages that were assigned to it by the process that has exited. The same page can be re-mapped by a new process which leads to the corruption.
The generic shutdown mechanism has a provision to unset the bus master even for a non-orderly shutdown of a worker process in lib.hardware.pci.shutdown()
. However, a trivial bug has prevented this mechanism from working. #1516 hopefully fixes this.
Hit this non-deterministic failure while hacking. This issue is a note in case it pops up again.