linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.45k stars 650 forks source link

Race condition leading to segfault when subsystem/controller removed during nvme list #2435

Closed smithkyl closed 1 month ago

smithkyl commented 1 month ago

Hit this running Debian 12 components (libnvme 1.3, nvme-cli 2.3), though it looks like this bug should still apply just with some different function names.

Was running "nvme list-subsys -o json" in a loop while subsystem was removed when the command cored with the following backtrace:

#0  __strlen_avx2
#1  json_object_new_string
#2  json_print_nvme_subsystem_multipath
#3  json_print_nvme_subsystem_list
#4  nvme_show_subsystem_list
#5  list_subsys
#6  handle_plugin
#7  main

Code was specifically trying to add the state field to the json blob, but the state returned was NULL. libjson-c doesn't NULL-check the string, but does immediately run strlen on it, causing the segfault and the stack above. Because a subsystem/controller was being disconnected/removed, I believe the removal hit a gap in time where kernel structures existed during the scan, but didn't when we went to query state.

After reading through code on master here, I believe this can still affect every print function calling nvme_ctrl_get_state, as none of the functions (or at least none of the print functions I've looked at) check for NULL from that call. That said, I think this only affects the state member, as the rest of the members look like they are typically gotten during allocation with allocation failing if population fails.

With all that in mind, I was wondering if folks thought that the bug should be handled here in a check-all-strings-for-NULL type fix or if the fix should be pushed down to libnvme with nvme_ctrl_get_state doing something like return c->state ? : "";

igaw commented 1 month ago

This sounds very familiar. I think we solved this bug in a newer version. Could you test if this is still present in the latest release?

fda312dc3aec ("json: fix seg. fault converting NULL to JSON string")

smithkyl commented 1 month ago

Sorry for the trouble. I had missed the second #define when reading through thinking that was a libjson-c function. Thanks for the pointer.