openbmc / bmcweb

A do everything Redfish, KVM, GUI, and DBus webserver for OpenBMC
Apache License 2.0
154 stars 131 forks source link

Stress testing virtual media over websocket, bmcweb buffer overflow #203

Closed LeeTroy closed 2 years ago

LeeTroy commented 3 years ago

Describe the bug Stress testing virtual media over websocket, bmcweb got coredump. We're still investigating if this is a hardware issue, so I post here to check anyone meet this issue before.

Environment We're running these test with both Aspeed AST2600-A1/A2 evaluation board. OpenBMC revision is 01eac59fbe806a32ab857dbceb69b8ede949e724

To Reproduce

  1. Mount virtual media from WebGUI, we tested with CentOS 8.3 DVD1 ISO (9GB)
  2. Goto bmc debug console or ssh into shell
  3. Check the nbd device is mounted over nbd-proxy by fdisk -l /dev/nbd0
  4. Dumping data from nbd device by dd if=/dev/nbd0 of=/dev/null bs=10M count=1K, repeatedly until it failed.

Is this a regression Normally, the virtual media works fine. This issue only happens under stress dumping data from nbd device.

root@evb-ast2600:~# fdisk -l /dev/nbd0                                                                                      
Disk /dev/nbd0: 8835 MB, 9264168960 bytes, 18094080 sectors
1126 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Device    Boot StartCHS    EndCHS        StartLBA     EndLBA    Sectors  Size Id Type
/dev/nbd0p1 *  0,0,1       1023,63,32           0   18094079   18094080 8835M  0 Empty
/dev/nbd0p2    1023,254,63 1023,254,63      24300      44867      20568 10.0M ef EFI (FAT-12/16/32)
root@evb-ast2600:~# dd if=/dev/nbd0 of=/dev/null bs=3M count=1K
1024+0 records in
1024+0 records out
root@evb-ast2600:~# dd if=/dev/nbd0 of=/dev/null bs=10M count=1K
[  919.907477] block nbd0: shutting down sockets
[  919.911338] block nbd0: Receive data failed (result -32)
[  919.919004] blk_update_request: I/O error, dev nbd0, sector 9848320 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  921.207036] blk_update_request: I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.218338] Buffer I/O error on dev nbd0, logical block 0, async page read
[  921.226163] blk_update_request: I/O error, dev nbd0, sector 1 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.237330] Buffer I/O error on dev nbd0, logical block 1, async page read
[  921.245080] blk_update_request: I/O error, dev nbd0, sector 2 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.256254] Buffer I/O error on dev nbd0, logical block 2, async page read
[  921.264010] blk_update_request: I/O error, dev nbd0, sector 3 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.275178] Buffer I/O error on dev nbd0, logical block 3, async page read
[  921.282996] blk_update_request: I/O error, dev nbd0, sector 4 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.294157] Buffer I/O error on dev nbd0, logical block 4, async page read
[  921.301962] blk_update_request: I/O error, dev nbd0, sector 5 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.313117] Buffer I/O error on dev nbd0, logical block 5, async page read
[  921.320931] blk_update_request: I/O error, dev nbd0, sector 6 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.332113] Buffer I/O error on dev nbd0, logical block 6, async page read
[  921.339914] blk_update_request: I/O error, dev nbd0, sector 7 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.351092] Buffer I/O error on dev nbd0, logical block 7, async page read
[  921.358941] blk_update_request: I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  921.370124] Buffer I/O error on dev nbd0, logical block 0, async page read
[  921.377938] Buffer I/O error on dev nbd0, logical block 1, async page read
[  921.385797]  nbd0: unable to read partition table
[  921.392808]  nbd0: unable to read partition table
[  921.421280]  nbd0: unable to read partition table
[  921.431394]  nbd0: unable to read partition table
[  938.035090] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.042011] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.048959] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.055928] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.062853] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.069829] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.076793] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.083698] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.090627] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.097641] print_req_error: 64 callbacks suppressed
[  938.097651] blk_update_request: I/O error, dev nbd0, sector 9848576 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.115392] blk_update_request: I/O error, dev nbd0, sector 9848832 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.127582] blk_update_request: I/O error, dev nbd0, sector 9849088 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.139844] blk_update_request: I/O error, dev nbd0, sector 9849344 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.152032] blk_update_request: I/O error, dev nbd0, sector 9849600 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.164194] blk_update_request: I/O error, dev nbd0, sector 9849856 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.176375] blk_update_request: I/O error, dev nbd0, sector 9850112 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  938.188614] blk_update_request: I/O error, dev nbd0, sector 9850368 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
[  938.200805] blk_update_request: I/O error, dev nbd0, sector 9850624 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  938.213108] block nbd0: Connection timed out, retrying (0/1 alive)
[  938.220108] blk_update_request: I/O error, dev nbd0, sector 9851392 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 0
root@evb-ast2600:~# journalctl -u bmcweb
-- Journal begins at Thu 1970-01-01 00:00:06 UTC, ends at Tue 2021-05-04 02:36:27 UTC. --
Apr 19 00:42:29 evb-ast2600 systemd[1]: Started Start bmcweb server.
May 04 02:22:18 evb-ast2600 bmcweb[179]: pam_succeed_if(webserver:auth): requirement "user ingroup redfish" was met by user "root"
May 04 02:35:43 evb-ast2600 bmcweb[179]: terminate called after throwing an instance of 'boost::wrapexcept<std::length_error>'
May 04 02:35:43 evb-ast2600 bmcweb[179]:   what():  buffer overflow
May 04 02:35:55 evb-ast2600 systemd[1]: bmcweb.service: Main process exited, code=dumped, status=6/ABRT
May 04 02:35:55 evb-ast2600 bmcweb[334]: nbd-proxy: wait: Interrupted system call
May 04 02:35:55 evb-ast2600 systemd[1]: bmcweb.service: Failed with result 'core-dump'.

With bmcweb-logging enabled:

bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":115] Read done.  Read 280 bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 16bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":75] inputBuffer empty.  Bailing out
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: (2021-05-04 05:00:36) [DEBUG "vm_websocket.hpp":84] Wrote 65536bytes
bmcweb[298]: terminate called after throwing an instance of 'boost::wrapexcept<std::length_error>'
bmcweb[298]:   what():  buffer overflow
systemd[1]: bmcweb.service: Main process exited, code=dumped, status=6/ABRT
bmcweb[496]: nbd-proxy: wait: Interrupted system call
systemd[1]: bmcweb.service: Failed with result 'core-dump'.
LeeTroy commented 3 years ago

Found the issue, refer to the following code segment: https://github.com/openbmc/bmcweb/blob/ffed87b5ad1797ca966d030e7f97977028d258fa/include/vm_websocket.hpp#L200-L212

While doWrite is an async operation, there are 16 bytes to-be-consumed data in the inputBuffer but the next message (128K+16Byte) coming before the inputBuffer finished writing. Then the inputBuffer->prepare() generate the buffer overflow exception without proper try/catch block to handle it, causing the bmcweb coredump.

L200: Should check with capacity() - size() in case there are some unprocessed data. Enlarge the inputBuffer size might be needed as well.

Not seeing the same issue in AST2500, perhaps AST2600 has dual core and it runs faster with bmcweb + nbd_proxy causing timing issue?

Reference: https://github.com/boostorg/beast/blob/47ea201d6ea8ef51bacf14a037808c05ca1f354e/include/boost/beast/core/impl/flat_static_buffer.ipp#L40-L60

LeeTroy commented 3 years ago

The other problem here is

May 04 02:35:55 evb-ast2600 bmcweb[334]: nbd-proxy: wait: Interrupted system call

Which introduced from: https://github.com/openbmc/bmcweb/blob/ffed87b5ad1797ca966d030e7f97977028d258fa/include/vm_websocket.hpp#L37-L42

The proxy.wait() were interrupted by it's SIGCHILD, might need to use proxy.wait(ec) and check the error code for retry.

edtanous commented 3 years ago

Both of these suggested patches seem reasonable to me; Can you get gerrit reviews opened for them please? I will review them once they're there.

LeeTroy commented 3 years ago

Sure, I'm still running some overnight stress tests, I'll submit to gerrit when I feel confident.

The input buffer size is trickier, I've seen there are more than 200K unprocess data in inputBuffer when BMC is under heavy loaded. However, it might not happen in normal use cases.

edtanous commented 3 years ago

Have your patches been submitted?

edtanous commented 3 years ago

Ping again for patch submission.

singhscp commented 3 years ago

We're seeing the same issue with Nuvoton Poleg BMC. If I increase the input buffer size sufficiently, then I don't see the original issue but websocket read fails. Here's a section of debug output I get at the time of failure, this is with build based on openbmc version 2.9.

Jul 12 22:18:30 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":77] inputBuffer empty. Bailing out Jul 12 22:18:30 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":86] Wrote 65536bytes Jul 12 22:18:30 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":86] Wrote 65536bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":86] Wrote 16bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":77] inputBuffer empty. Bailing out Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":86] Wrote 65536bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:30) [DEBUG "vm_websocket.hpp":86] Wrote 65536bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:31) [DEBUG "vm_websocket.hpp":86] Wrote 16bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:31) [DEBUG "vm_websocket.hpp":77] inputBuffer empty. Bailing out Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:31) [ERROR "websocket.hpp":210] doRead error asio.ssl:336130329 Jul 12 22:18:31 R650xs kernel: unregistered gadget driver 'configfs-gadget' Jul 12 22:18:31 R650xs kernel: block nbd0: shutting down sockets Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:31) [DEBUG "vm_websocket.hpp":117] Read done. Read 0 bytes Jul 12 22:18:31 R650xs bmcweb[933]: (2021-07-12 22:18:31) [ERROR "vm_websocket.hpp":121] Couldn't read from VM port: asio.misc:2

edtanous commented 3 years ago

Does anyone have interest in getting the above patches submitted?

LeeTroy commented 3 years ago

Oops, I forgot about this after I applied a patch internally, sorry.

As @singhscp mentioned, I never came out a reasonable value for the buffer, it always failed in some way (especially on multi-core soc). I increase the size to (128K+16Byte)*4, is that okay?

singhscp commented 3 years ago

I had also made the buffer 4 times the original size and I didn't see the buffer overrun issue after that.

LeeTroy commented 3 years ago

Submitted: https://gerrit.openbmc-project.xyz/c/openbmc/bmcweb/+/45475

Please help to review, thanks!

edtanous commented 2 years ago

The aforementioned review is merged. Please reopen if there are any new issues.