indexdata / idzebra

Search engine for structured data
http://www.indexdata.com/zebra
GNU General Public License v2.0
23 stars 6 forks source link

Buffer overflow when indexing more than 10000000 records #44

Closed jprante closed 1 year ago

jprante commented 1 year ago

Version: Zebra 2.2.5

When indexing over 10000000 records, there is a buffer overflow in zebraidx

$ coredumpctl dump --output zebraidx.core
           PID: 3326585 (zebraidx)
           UID: 1000 (joerg)
           GID: 1000 (joerg)
        Signal: 6 (ABRT)
     Timestamp: Sun 2022-12-11 22:39:42 CET (8h ago)
  Command Line: zebraidx -c /etc/zebra/zebra.cfg -l /data/lvi/zebra/log/zebra.log -g lvi update /data/lvi/import
    Executable: /usr/bin/zebraidx-2.0
 Control Group: /user.slice/user-1000.slice/session-1476.scope
          Unit: session-1476.scope
         Slice: user-1000.slice
       Session: 1476
     Owner UID: 1000 (joerg)
       Boot ID: 9a7e081eb20049a1bc8f80405315ec86
    Machine ID: b28b86bd42c0463bb5e713fdb934d355
      Hostname: zeus
       Storage: /var/lib/systemd/coredump/core.zebraidx.1000.9a7e081eb20049a1bc8f80405315ec86.3326585.1670794782000000.zst (present)
     Disk Size: 44.6M
       Message: Process 3326585 (zebraidx) of user 1000 dumped core.

                Module linux-vdso.so.1 with build-id 41ada1ba656c6fd3ad2afaac7d5798df7e29138b
                Module mod-text.so with build-id 269ce6318c23abdcda1203595a5a905e894bf1ef
                Module mod-safari.so with build-id f00dad88165a5a15bb9699f47ebc13d9cc6279dd
                Module libexpat.so.1 with build-id c13fed6a7eb00fb957bc772e969d7d768b97ed9b
                Module mod-grs-xml.so with build-id 221924047c09667c912dbea58049fe7ef9810bfc
                Module mod-grs-regx.so with build-id c4b1a09d46d1e56f0c0bfaeaabaa5e9d192c4631
                Module mod-grs-marc.so with build-id 3f2367951b3a5e5d3d7de2ab08f5940e70b73dcd
                Module mod-dom.so with build-id 0d68d267c23d69afa316324b457410313f75436d
                Module mod-alvis.so with build-id 6fb53edf3e4efcf4a98d499d70bf5cdb70e97c17
                Module libffi.so.8 with build-id 48e3675db4765a2e42729140922e11a10016f7ab
                Module libhogweed.so.6 with build-id 97ce01a5c43483f58a364086c521ec45dc1d3a3a
                Module libnettle.so.8 with build-id a5e63d290dbce2f78dfdfde45b9865adbf312515
                Module libtasn1.so.6 with build-id 3d3a2f6f0d4a70919496afe25e329abd189b7882
                Module libunistring.so.2 with build-id 15e34cdfafa3547f9c700489b842ceb86f6fb73e
                Module libidn2.so.0 with build-id 958c50fc94ecb196b24f3619762e7ec3f28a5b40
                Module libp11-kit.so.0 with build-id 5e20d86b92c9f913571338c18cb70f74da7d3c0e
                Module ld-linux-x86-64.so.2 with build-id df9c6b298bf5e3c1d0eb6a0911f3f561908a704d
                Module libgcc_s.so.1 with build-id 9526c65fed0e95fbb6b988476cc811ca19d5c9c9
                Module libstdc++.so.6 with build-id abd5d7149726b0410af7af2e9a59491942605ddd
                Module liblzma.so.5 with build-id 330eb2fe0769e5466e2e0ac1b158e1e8452738c9
                Module libgnutls.so.30 with build-id 23d3a604a15f3f1f2293e0b5423b8e31a8b7de30
                Module libc.so.6 with build-id b439a356c78dfa4bd24c75a16f564540db2a30ad
                Module libcrypt.so.2 with build-id 6ce4e5eb200e61d07398af52f8bcb316cf8466e0
                Module libbz2.so.1 with build-id dc9cd83a2cf3038bd3f04cc111c32e2c5698b5d3
                Module libz.so.1 with build-id f3d999799f183753842b16d4b510e983a1aba620
                Module libm.so.6 with build-id c0eb573a2171d96b1aa970edb07f3368573bf845
                Module libicudata.so.67 with build-id a980f32bc1fc2ee613ed6123767f7432721156e6
                Module libicuuc.so.67 with build-id 37500757998bd80043c2d781673567eff6777273
                Module libicui18n.so.67 with build-id 859328d730d641ef2be196b432bb27061b80b84b
                Module libxml2.so.2 with build-id 22a5cc77ed905c1c0e3450d081f3bafffa789b4d
                Module libxslt.so.1 with build-id cdad1bad6f8bf3fad76523c54df03607b7fa391e
                Module libexslt.so.0 with build-id f0395c9288c25c3c8b1dce6b472315282ef64843
                Module libyaz.so.5 with build-id 008c6fed8d968484ba07a8d3242791bf886550e6
                Module libyaz_icu.so.5 with build-id be120492428f37261f79774ac4a21722fd3722b2
                Module libyaz_server.so.5 with build-id 21bad638b3b938cf62b7327b5207e5942cada89c
                Module libidzebra-2.0.so.0 with build-id 8437f46f0f39948b6d966e058387a5fa2d649b8a
                Module zebraidx-2.0 with build-id 22bad8487f3263914bd928578c705155610c4453
                Stack trace of thread 3326585:
                #0  0x00007f74a41e054c __pthread_kill_implementation (libc.so.6 + 0xa154c)
                #1  0x00007f74a4193ce6 raise (libc.so.6 + 0x54ce6)
                #2  0x00007f74a41677f3 abort (libc.so.6 + 0x287f3)
                #3  0x00007f74a41d4547 __libc_message (libc.so.6 + 0x95547)
                #4  0x00007f74a429c2ca __fortify_fail (libc.so.6 + 0x15d2ca)
                #5  0x00007f74a429ad06 __chk_fail (libc.so.6 + 0x15bd06)
                #6  0x00007f74a41cd0ef _IO_str_chk_overflow (libc.so.6 + 0x8e0ef)
                #7  0x00007f74a41d84e1 _IO_default_xsputn (libc.so.6 + 0x994e1)
                #8  0x00007f74a41c2276 __vfprintf_internal (libc.so.6 + 0x83276)
                #9  0x00007f74a41cd194 __vsprintf_internal (libc.so.6 + 0x8e194)
                #10 0x00007f74a429a821 __sprintf_chk (libc.so.6 + 0x15b821)
                #11 0x00007f74a68a6ca2 data1_mk_tag_data_zint (libidzebra-2.0.so.0 + 0x4eca2)
                #12 0x00007f74a6891b1a zebraExplain_writeDatabase (libidzebra-2.0.so.0 + 0x39b1a)
                #13 0x00007f74a6891ef0 zebraExplain_flush (libidzebra-2.0.so.0 + 0x39ef0)
                #14 0x00007f74a689254b zebra_end_transaction (libidzebra-2.0.so.0 + 0x3a54b)
                #15 0x00007f74a68926ee zebra_end_trans (libidzebra-2.0.so.0 + 0x3a6ee)
                #16 0x0000000000402b34 main (zebraidx-2.0 + 0x2b34)
                #17 0x00007f74a417eeb0 __libc_start_call_main (libc.so.6 + 0x3feb0)
                #18 0x00007f74a417ef60 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3ff60)
                #19 0x0000000000402ea5 _start (zebraidx-2.0 + 0x2ea5)

                Stack trace of thread 3326586:
                #0  0x00007f74a41db39a __futex_abstimed_wait_common (libc.so.6 + 0x9c39a)
                #1  0x00007f74a41ddba0 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x9eba0)
                #2  0x00007f74a6890ff3 thread_func (libidzebra-2.0.so.0 + 0x38ff3)
                #3  0x00007f74a41de802 start_thread (libc.so.6 + 0x9f802)
                #4  0x00007f74a417e450 __clone3 (libc.so.6 + 0x3f450)
                ELF object binary architecture: AMD x86-64
More than one entry matches, ignoring rest.

A workaround is https://github.com/indexdata/idzebra/pull/43 but the best method would be to use a larger buffer space.

With the workaround, I get the following messages

08:25:13-14/12 zebraidx(3988166) [log] Records: 28529000 i/u/d 28529000/0/0
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'recordCountActual', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'recordBytes', chars to print = 12, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 9, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 9, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'dococcurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'termoccurrences', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'id', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'attributeDetailsId', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'id', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'attributeDetailsId', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'id', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'attributeDetailsId', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'id', chars to print = 8, available = 8
08:25:13-14/12 zebraidx(3988166) [warn] buffer overflow for tag 'attributeDetailsId', chars to print = 8, available = 8
08:25:25-14/12 zebraidx(3988166) [log] Merge 0.6% completed; 26 minutes remaining

My recommendation is to always use snprintf instead of sprintf throughout the project.

adamdickmeiss commented 1 year ago

Can you see if branch https://github.com/indexdata/idzebra/tree/issue-44-overflow-zint-data1-node fixes it for you.

adamdickmeiss commented 1 year ago

As mentioned in #43 the problem is likely NOT the number of records but a stats number "recordBytes" which exceeds 1e12 and gives bufferflow in the buffer 12 bytes (DATA1_LOCALDATA).

adamdickmeiss commented 1 year ago

Should be fixed in master now as #47 is merged.

jprante commented 1 year ago

The patch helped, thanks!