intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
BSD 3-Clause "New" or "Revised" License
409 stars 128 forks source link

memory corruption when using qat service #79

Open tokers opened 6 years ago

tokers commented 6 years ago

Hello!

qat driver configure:

./configure --enable-icp-hb-fail-sim

qat engine configure:

./configure --with-openssl_dir=/path/to/openssl --with-openssl_install_dir=/path/to/openssl_install --with-qat_dir=/path/to/qat --enable-usdm --enable-upstream_driver

lsmod (removes the irrelevant entries):

qat_dh895xcc            6586  1
usdm_drv               82195  19
intel_qat             152264  56 qat_dh895xcc,usdm_drv
sha512_generic          4943  0
uio                     7430  37 intel_qat
dh_generic              2535  1 intel_qat
authencesn              4740  0
authenc                 4119  2 intel_qat,authencesn
rsa_generic             8940  1 intel_qat
mpi                    12055  2 dh_generic,rsa_generic
asn1_decoder            2882  1 rsa_generic

In addition, we closed the boot option intel_iommu.

We are using the QAT dh895xcc series card for our own OpenResty/Nginx service. While we pouring traffic the Nginx with qat service, it crashes crazy. Some backtraces like:

2018/07/30 19:15:56 [alert] 273240#273240: backtrace: [00] nginx: worker process(ngx_backtrace+0x1f) [0x4595a4]
2018/07/30 19:15:56 [alert] 273240#273240: backtrace: [01] nginx: worker process() [0x459838]
2018/07/30 19:15:56 [alert] 273240#273240: backtrace: [02] /lib64/libpthread.so.0(+0xf7e0) [0x7f7d1b3d67e0]
2018/07/30 19:15:56 [alert] 273240#273240: backtrace: [03] [0x7f7d44fac29d]
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x1
2018/07/30 19:15:50 [alert] 273235#273235: worker process 273235 exited on signal 11 (core dumped)
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [00] nginx: worker process(ngx_backtrace+0x1f) [0x4595a4]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [01] nginx: worker process() [0x459838]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [02] /lib64/libpthread.so.0(+0xf7e0) [0x7f7d1b3d67e0]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [03] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0x2bedd) [0x7f7d1af42edd]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [04] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0x30fdf) [0x7f7d1af47fdf]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [05] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0x46851) [0x7f7d1af5d851]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [06] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0xa736) [0x7f7d1af21736]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [07] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0x4714f) [0x7f7d1af5e14f]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [08] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0x14681) [0x7f7d1af2b681]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [09] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(+0xbdaa) [0x7f7d1af22daa]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [10] /usr/local/marco/luajit/lib/libluajit-5.1.so.2(lua_pcall+0x2d) [0x7f7d1af3137d]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [11] nginx: worker process() [0x554056]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [12] nginx: worker process(ngx_http_lua_cache_loadbuffer+0x4e) [0x55422c]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [13] nginx: worker process(ngx_http_lua_filter_set_by_lua_inline+0x8b) [0x549bd4]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [14] nginx: worker process() [0x52c2d9]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [15] nginx: worker process() [0x4debc1]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [16] nginx: worker process(ngx_http_core_rewrite_phase+0x21) [0x470a40]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [17] nginx: worker process(ngx_http_core_run_phases+0x90) [0x47093b]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [18] nginx: worker process(ngx_http_handler+0x1b1) [0x4708a9]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [19] nginx: worker process(ngx_http_process_request+0x314) [0x47ef5b]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [20] nginx: worker process() [0x47d9f6]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [21] nginx: worker process() [0x47cf83]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [22] nginx: worker process() [0x47bd7b]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [23] nginx: worker process() [0x45eb61]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [24] nginx: worker process(ngx_process_events_and_timers+0xd3) [0x44f29b]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [25] nginx: worker process() [0x45c42b]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [26] nginx: worker process(ngx_spawn_process+0x656) [0x458eda]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [27] nginx: worker process() [0x45b570]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [28] nginx: worker process(ngx_master_process_cycle+0x296) [0x45ac59]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [29] nginx: worker process(main+0x564) [0x41cef0]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [30] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f7d1a160d5d]
2018/07/30 19:15:50 [alert] 273235#273235: backtrace: [31] nginx: worker process() [0x41c799]

The first backtrace, shows that we are attempting to free the pointer where the address is invalid (0x1). The second backtrace, after my own analysis, it crashes when LuaJIT is restoring the stack snapshot (back to interpreter), I have also sent email to luajit community for this issue, by the way, when I disable the JIT compiler, this type of segmentation fault disappears.

After I disable the qat service, our service works well. I don't know wether the qat service causes some memory corruption.

Is there any idea for this issue?

Regards Alex Zhang

stevelinsell commented 6 years ago

Hi Alex,

I'm not familiar with LuaJIT but my understanding from a quick bit of reading is that the nginx module for OpenResty uses LuaJIT co-routines to run asynchronously and avoid blocking. Internally luaJIT is using longjmp within the co-routine implementation to yield. The trouble is OpenSSL when running asynchronously using the QAT Engine is also using it's own co-routines for each connection. These co-routines (referred to as fibres) are allocating their own stack (from the heap) and switching to their own stack when running and also use longjmp to switch out of the co-routine back to the standard stack. Perhaps there is something bad going on between the two implementations of co-routines. Perhaps something related to both using longjmp and something getting confused or getting corrupted to do with the stack? Anyway that is purely a guess. Unfortunately I'm not aware of any one here running QuickAssist with nginx and also using LuaJIT so I maybe of limited help. I'll let you know if I find out anything further.

Steve.

tokers commented 6 years ago

@stevelinsell Thanks for the response. I think it is irrelevant with the co-routines since when I disabled the LuaJIT Just In Time mode, this problem disappeared. I will ask this issue to the OpenResty community for some helps. Anyway, thanks again!

tokers commented 6 years ago

@stevelinsell I want to toggle the QAT driver version and I found two qat1.7 drivers:

Is there any other qat1.7 drivers?

stevelinsell commented 6 years ago

Hi Alex,

I believe at this moment in time there have only been 3 versions of QAT1.7 drivers made available on 01.org. Starting with the oldest:

Only L.1.0.3-00042 and L.4.2.0-00022 remain available for download.

Kind Regards,

Steve.