intel / hyperscan

High-performance regular expression matching library
https://www.hyperscan.io
Other
4.84k stars 722 forks source link

Alignment issues when using database mmapped from a file #376

Open vstakhov opened 2 years ago

vstakhov commented 2 years ago

I have received several reports from Rspamd users about crashes when using Hyperscan database that is mmapped from a file (to reduce memory footprint when using the same database from multiple workers). One of the reports with some details could be found here.

In general, I load a serialized database, then dump it to a memory buffer (properly aligned) and then mmap and cast to hs_database_t. The pointer of the database itself is properly aligned on page boundary as guaranteed by mmap. So far, I'm not quite sure about what's going on in this case.

hongyang7 commented 2 years ago

In theory, maskBase (the fat-teddy table header) should be located at a 64-byte aligned address.

While in this problematic case, address of maskBase is 0x40(%r9),

(gdb) disassemble $pc-64,$pc+64
Dump of assembler code from 0x7fd9ab635e7f to 0x7fd9ab635eff:
   ......
=> 0x00007fd9ab635ebf <fdr_exec_fat_teddy_msks3+2751>:  vmovdqa 0x40(%r9),%ymm7
   ......

where %r9 is 0x7fd9a63551b0 (16-byte aligned):

(gdb) info registers 
......
r9             0x7fd9a63551b0      140572773142960
......

So maskBase is 16-byte aligned.

For further investigation, we wonder if they can provide:

  1. More information that can yield some addresses of fat-teddy structure for current issue case, such as:

    We want to see more results of address alignment for critical internal headers within database.

  2. Same address results under a normal run case (like in rspamd 3.3).

vstakhov commented 2 years ago

Would it help if they provide the whole database, as it can be easily extracted from the filesystem? The address of maskBase also lays within the mmapped region, so it might be easier to investigate.

hongyang7 commented 2 years ago

Ok, we can take a look at that first.

vstakhov commented 2 years ago

Well, I have received a file and another core from a different user. Attaching meaningful pieces:

   0x00007f6877bee7da <+4442>:  jne    0x7f6877bedf0a <fdr_exec_fat_teddy_msks3+2186>
   0x00007f6877bee7e0 <+4448>:  jmp    0x7f6877bedfb0 <fdr_exec_fat_teddy_msks3+2352>
   0x00007f6877bee7e5 <+4453>:  mov    -0x4(%rax,%rcx,1),%edx
   0x00007f6877bee7e9 <+4457>:  mov    %edx,-0x4(%r12,%rcx,1)
   0x00007f6877bee7ee <+4462>:  mov    (%rax),%eax
   0x00007f6877bee7f0 <+4464>:  mov    %eax,(%r12)
   0x00007f6877bee7f4 <+4468>:  vmovdqa 0x1c0(%rsp),%xmm1
   0x00007f6877bee7fd <+4477>:  movabs $0xf0f0f0f0f0f0f0f,%rax
   0x00007f6877bee807 <+4487>:  vinserti128 $0x1,%xmm1,%ymm1,%ymm1
   0x00007f6877bee80d <+4493>:  vinserti128 $0x1,%xmm5,%ymm5,%ymm5
   0x00007f6877bee813 <+4499>:  movl   $0xffffffff,0x138(%rsp)
   0x00007f6877bee81e <+4510>:  vmovq  %rax,%xmm0
   0x00007f6877bee823 <+4515>:  mov    0x180(%rsp),%rax
   0x00007f6877bee82b <+4523>:  vpsrlq $0x4,%ymm1,%ymm2
   0x00007f6877bee830 <+4528>:  mov    0x160(%rsp),%r14
   0x00007f6877bee838 <+4536>:  vpbroadcastq %xmm0,%ymm0
=> 0x00007f6877bee83d <+4541>:  vmovdqa 0x60(%rax),%ymm7
   0x00007f6877bee842 <+4546>:  vmovdqa 0x40(%rax),%ymm6
   0x00007f6877bee847 <+4551>:  vpand  %ymm0,%ymm1,%ymm1

%rax 0x7f6862c5daf0

hs_db is allocated at the address 0x7f6862c53000 and has size 47744 bytes (so the address 0x7f6862c5daf0 is within that space).

b8412d44b6968ea6395a1d75893b9fe32da815aeed1e2c3f56c287c030501663.unser.zip

Here is the file that caused this crash.

Please let me know if I can ask for more details. I've asked to add debuginfo for the Hyperscan library, so I hope it would help when this issue will be reproduced next time.

vstakhov commented 2 years ago

Some more addresses: https://dpaste.com/DSE4H6V9V

fdr = 0x7efe3c43d1d0
ftable = 0x7efe3c43d110

Full RoseEngine dump at frame 7:

(gdb) p *t
$2 = {pureLiteral = 0 '\000', noFloatingRoots = 0 '\000', requiresEodCheck = 1 '\001', hasOutfixesInSmallBlock = 1 '\001', runtimeImpl = 0 '\000', mpvTriggeredByLeaf = 0 '\000', canExhaust = 1 '\001', hasSom = 0 '\000',
  somHorizon = 0 '\000', mode = 1, historyRequired = 23, ekeyCount = 9, lkeyCount = 0, lopCount = 0, ckeyCount = 0, logicalTreeOffset = 0, combInfoMapOffset = 0, dkeyCount = 5, dkeyLogSize = 32, invDkeyOffset = 9668,
  somLocationCount = 0, somLocationFatbitSize = 32, rolesWithStateCount = 0, stateSize = 0, anchorStateSize = 1, tStateSize = 0, scratchStateSize = 1, smallWriteOffset = 31104, amatcherOffset = 9728, ematcherOffset = 24512,
  fmatcherOffset = 16576, drmatcherOffset = 0, sbmatcherOffset = 0, longLitTableOffset = 0, amatcherMinWidth = 3, fmatcherMinWidth = 2, eodmatcherMinWidth = 1, amatcherMaxBiAnchoredWidth = 4294967295,
  fmatcherMaxBiAnchoredWidth = 4294967295, reportProgramOffset = 1172, reportProgramCount = 13, delayProgramOffset = 6332, anchoredProgramOffset = 5108, activeArrayCount = 1, activeLeftCount = 0, queueCount = 1,
  activeQueueArraySize = 32, eagerIterOffset = 0, handledKeyCount = 0, handledKeyFatbitSize = 32, leftOffset = 0, roseCount = 0, eodProgramOffset = 9648, flushCombProgramOffset = 0, lastFlushCombProgramOffset = 0,
  lastByteHistoryIterOffset = 0, minWidth = 1, minWidthExcludingBoundaries = 1, maxBiAnchoredWidth = 4294967295, anchoredDistance = 23, anchoredMinDistance = 0, floatingDistance = 4294967295, floatingMinDistance = 0,
  smallBlockDistance = 32, floatingMinLiteralMatchOffset = 1, nfaInfoOffset = 30988, initialGroups = 1, floating_group_mask = 1, size = 198544, delay_count = 8, delay_fatbit_size = 32, anchored_count = 3,
  anchored_fatbit_size = 32, maxFloatingDelayedMatch = 4294967295, delayRebuildLength = 22, stateOffsets = {history = 4, exhausted = 27, exhausted_size = 2, logicalVec = 29, logicalVec_size = 0, combVec = 29, combVec_size = 0,
    activeLeafArray = 1, activeLeafArray_size = 1, activeLeftArray = 2, activeLeftArray_size = 0, leftfixLagTable = 2, anchorState = 2, groups = 3, groups_size = 1, longLitState = 2, longLitState_size = 0, somLocation = 0,
    somValid = 0, somWritable = 0, somMultibit_size = 0, nfaStateBegin = 29, end = 30}, boundary = {reportEodOffset = 0, reportZeroOffset = 0, reportZeroEodOffset = 0}, totalNumLiterals = 134, asize = 6848, outfixBeginQueue = 0,
  outfixEndQueue = 1, leftfixBeginQueue = 1, initMpvNfa = 4294967295, rosePrefixCount = 0, activeLeftIterOffset = 0, ematcherRegionSize = 23, somRevCount = 0, somRevOffsetOffset = 0, longLitStreamState = 0, state_init = {
    s_u64a_offset = 0, s_u64a_count = 0, s_u32_offset = 0, s_u32_count = 0, s_u16_offset = 0, s_u16_count = 0, s_u8_count = 1, s_u8_offset = 31040}}
hongyang7 commented 2 years ago
fdr = 0x7efe3c43d1d0
ftable = 0x7efe3c43d110

Well, from these 2 addresses, both 16-byte aligned, there might be something unexpected happen. Hyperscan compilation and (de)serialization should make sure these internal critical pointers all aligned with 64 bytes.

To get a further insight, it would be great if they can provide more info, especially about database header before and after deserialization:

struct hs_database {
    u32 magic;
    u32 version;
    u32 length;
    u64a platform;
    u32 crc32;
    u32 reserved0;
    u32 reserved1;
    u32 bytecode;    // offset relative to db start
    u32 padding[16];
    char bytes[];
};

We need below 4 values around above struct:

  1. All these 4 values can be printed at the end of functions https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/database.c#L243 and https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/database.c#L200 as I see the rspamd 3.4 code uses both of the 2 deserialization APIs.

  2. For comparison, we also need the same 4 values printed just after a database is created, at the end of function: https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/compiler/compiler.cpp#L471

All these results help us to see whether the 64-byte alignment guarantee works good.

vstakhov commented 2 years ago

Well, it seems I've figured out what's happening. The problem is that you have a structure that is aligned on a lower boundary requirement than it's individual fields. To fix it you insert some padding on deserialisation. However, this fix works merely for that specific alignment of a base structure but not for anything else.

For example, let's see what happens if you have a structure base that is aligned on 8 bytes bondary to the hs_deserialize_database_at. In this case, you insert some padding to have specific fields aligned by 32/64 bytes. Subsequently, if this structure is remapped to an address on 4k boundary, that efficiently moves those fields away from the proper alignment due to paddings inserted.

For my particular case, the solution was simple - to provide the same alignment to the hs_deserialize_database_at as it would be used afterwards after mmap call. However, it might be worth either to mention in the documentation or (even better) require alignment of the target structure at the boundary more or equal comparing to any of the enclosing members. The former option is more safe than artificial padding but it might break compatibility somehow...

hongyang7 commented 2 years ago

I think you're right about the root cause. I just ignored the influence of mmap as we rarely do the database movement/mapping thing like in the real production environment.

Let me put it more specific here.

The hyperscan database can be located at any alignment, but Hyperscan runtime must correctly work on many 64-byte aligned structures within the database. This is accomplished by determining the very first 64-byte aligned address in a database.

Specifically, the member char bytes[] of struct hs_database can be used to locate the head position of the real 64-byte-aligned space operated by Hyperscan runtime. Based on proper calculation (shown below), the real head position of the 64-byte-aligned space may fall into member u32 padding[16].

    // we need to align things manually
    size_t shift = (uintptr_t)db->bytes & 0x3f;
    db->bytecode = offsetof(struct hs_database, bytes) - shift; // this offset result may fall into the padding space
    char *bytecode = (char *)db + db->bytecode;
    assert(ISALIGNED_CL(bytecode));

This calculation will be surely done under 2 cases: when database is firstly generated, and when database is deserialized to some place. In either case, we imagine users won't move/copy the database object any more.

Theoretically, hyperscan cannot guarantee the usability of the database if it's simply moved to a new location by operations like memory copy. etc. Even when the alignment of new location is same as original location, we still cannot guarantee that.

For example, given an original db space address 0x10 (16-byte aligned), the calculated cache-aligned offset will be 0x30, thus the real cache-aligned address is 0x10 + 0x30 = 0x40 (64-byte aligned). If we directly copy the db to another 16-byte aligned address, saying 0x30, the expected"cache-aligned" address will be 0x30 + 0x30 = 0x60 (32-byte aligned, not what we want).

Hint: I think perhaps only when original space is 64-byte aligned and new location is multiples of 64-byte (64/128/192/256/..., not simply larger than 64) aligned can the usability be guaranteed. Because only this case won't be affected by the problem in the example (I'll double check on this).

Due to all these complexity, currently users had better to make sure their runtime db is just the db got directly from hs_compile() APIs, or from hs_deserialization_database() APIs.

Since db movement (copy/mapping) might happen in real use cases, we may consider to provide something like a database copy API in the future to cover the alignment issue, or just mention this issue more clearly in our document as you suggested.

For the rspamd mmap issue, I think the idea to use same alignment for both deserialization and after mmap is basically OK. But I‘m not sure whether the values of page_size can satisfy above Hint.