microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
778 stars 211 forks source link

cchost exits/crashes when joining a node using a read only snapshot mounted via blobfuse2 backed storage #6466

Open gaurav137 opened 1 week ago

gaurav137 commented 1 week ago

When hosting the read only snapshot directory backed by a blobfuse2 mount point the cchost process seems to exit/crash after logging "All sanity checks passed". Need to debug this more. Filing the bug for tracking purposes.

 2024-09-04T10:28:39.592537Z        100 [info ] ../src/host/main.cpp:444             | Registering RPC interface debug_interface, on  0.0.0.0:8082
2024-09-04T10:28:39.592581Z        100 [info ] ../src/host/socket.h:49              | TCP RPC Client listening on 0.0.0.0:8082
2024-09-04T10:28:39.592595Z        100 [info ] ../src/host/main.cpp:458             | Registered RPC interface debug_interface, on  0.0.0.0:8082
2024-09-04T10:28:39.592605Z        100 [info ] ../src/host/main.cpp:444             | Registering RPC interface primary_rpc_interface, on  0.0.0.0:8080
2024-09-04T10:28:39.592632Z        100 [info ] ../src/host/socket.h:49              | TCP RPC Client listening on 0.0.0.0:8080
2024-09-04T10:28:39.592643Z        100 [info ] ../src/host/main.cpp:458             | Registered RPC interface primary_rpc_interface, on  0.0.0.0:8080
2024-09-04T10:28:39.592719Z        100 [info ] ../src/host/main.cpp:593             | Startup host time: 2024-09-04 10:28:39.592716
2024-09-04T10:28:39.592758Z        100 [info ] ../src/host/main.cpp:651             | Creating new node - join existing network at host.docker.internal:59583
2024-09-04T10:28:41.966691Z        100 [info ] ../src/host/main.cpp:696             | Found latest snapshot file: /mnt/ro-snapshots/snapshots/snapshot_7_8.committed (size: 3168)
2024-09-04T10:28:41.966726Z        100 [info ] ../src/host/main.cpp:716             | Initialising enclave: enclave_create_node
2024-09-04T10:28:41.968036Z        100 [debug] ../src/host/enclave.h:239            | Padding config of size 1912 to 1912 bytes
2024-09-04T10:28:41.968084Z        100 [debug] ../src/host/enclave.h:248            | Padding startup snapshot of size 3168 to 3168 bytes
2024-09-04T10:28:42.067360Z        0   [debug] ../src/enclave/main.cpp:56           | All sanity check tests passed
root@cc8e54fc8945:/app#
gaurav137 commented 1 week ago

Update. Running cchost in gdb gave this output:

2024-09-05T07:48:19.978464Z        100 [fail ] ../src/host/main.cpp:276             | DEPRECATED: Enclave path was specified in config file! This should be removed from the config, and passed directly to the CLI instead
Dwarf Error: DW_FORM_strx1 found in non-DWO CU [in module /usr/lib/ccf/libjs_generic.virtual.so]
2024-09-05T07:48:20.265254Z        100 [info ] ../src/host/ledger.h:1067            | Recovering main ledger directory /mnt/storage/ledger
2024-09-05T07:48:21.273732Z        100 [info ] ../src/host/ledger.h:1114            | Main ledger directory /mnt/storage/ledger is empty: no ledger file to recover
2024-09-05T07:48:21.274045Z        100 [info ] ../src/host/snapshots.h:198          | Snapshots will be stored in existing directory: /mnt/storage/snapshots
2024-09-05T07:48:21.274523Z        100 [info ] ../src/host/lfs_file_handler.h:23    | Clearing contents from existing directory .index
2024-09-05T07:48:21.275173Z        100 [info ] ../src/host/socket.h:49              | TCP Node Server listening on 0.0.0.0:8081
2024-09-05T07:48:21.275210Z        100 [info ] ../src/host/main.cpp:444             | Registering RPC interface debug_interface, on  0.0.0.0:8082
2024-09-05T07:48:21.275234Z        100 [info ] ../src/host/socket.h:49              | TCP RPC Client listening on 0.0.0.0:8082
2024-09-05T07:48:21.275249Z        100 [info ] ../src/host/main.cpp:458             | Registered RPC interface debug_interface, on  0.0.0.0:8082
2024-09-05T07:48:21.275288Z        100 [info ] ../src/host/main.cpp:444             | Registering RPC interface primary_rpc_interface, on  0.0.0.0:8080
2024-09-05T07:48:21.275323Z        100 [info ] ../src/host/socket.h:49              | TCP RPC Client listening on 0.0.0.0:8080
2024-09-05T07:48:21.275356Z        100 [info ] ../src/host/main.cpp:458             | Registered RPC interface primary_rpc_interface, on  0.0.0.0:8080
2024-09-05T07:48:21.275422Z        100 [info ] ../src/host/main.cpp:593             | Startup host time: 2024-09-05 07:48:21.275419
2024-09-05T07:48:21.275459Z        100 [info ] ../src/host/main.cpp:651             | Creating new node - join existing network at host.docker.internal:55532
2024-09-05T07:48:23.459607Z        100 [info ] ../src/host/main.cpp:696             | Found latest snapshot file: /mnt/ro-snapshots/snapshots/snapshot_15_16.committed (size: 63406)
2024-09-05T07:48:23.459759Z        100 [info ] ../src/host/main.cpp:716             | Initialising enclave: enclave_create_node
[New Thread 0x7fbff1fde700 (LWP 444)]
2024-09-05T07:48:23.736428Z        0   [debug] ../src/enclave/main.cpp:56           | All sanity check tests passed

Thread 1 "cchost" received signal SIGSEGV, Segmentation fault.

0x00007fbff612e56f in nlohmann::json_abi_v3_11_3::detail::lexer<nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void>, nlohmann::json_abi_v3_11_3::detail::iterator_input_adapter<unsigned char const*> >::get() () from /usr/lib/ccf/libjs_generic.virtual.so

(gdb)
achamayou commented 1 week ago
(gdb) bt
#0  0x00007fbff612e56f in nlohmann::json_abi_v3_11_3::detail::lexer<nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void>, nlohmann::json_abi_v3_11_3::detail::iterator_input_adapter<unsigned char const*> >::get() () from /usr/lib/ccf/libjs_generic.virtual.so
#1  0x00007fbff612b11e in nlohmann::json_abi_v3_11_3::detail::lexer<nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void>, nlohmann::json_abi_v3_11_3::detail::iterator_input_adapter<unsigned char const*> >::scan() () from /usr/lib/ccf/libjs_generic.virtual.so
#2  0x00007fbff6127bf3 in nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void> nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void>::parse<unsigned char const*>(unsigned char const*, unsigned char const*, std::__1::function<bool (int, nlohmann::json_abi_v3_11_3::detail::parse_event_t, nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >, void>&)>, bool, bool) () from /usr/lib/ccf/libjs_generic.virtual.so
#3  0x00007fbff61c7259 in ccf::deserialise_snapshot(std::__1::shared_ptr<ccf::kv::Store> const&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > const&, std::__1::vector<std::__1::unique_ptr<ccf::kv::ConsensusHook, std::__1::default_delete<ccf::kv::ConsensusHook> >, std::__1::allocator<std::__1::unique_ptr<ccf::kv::ConsensusHook, std::__1::default_delete<ccf::kv::ConsensusHook> > > >&, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >*, bool, std::__1::optional<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > >) () from /usr/lib/ccf/libjs_generic.virtual.so
#4  0x00007fbff6221d17 in ccf::NodeState::initialise_startup_snapshot(bool) () from /usr/lib/ccf/libjs_generic.virtual.so
#5  0x00007fbff621daad in ccf::NodeState::launch_node() () from /usr/lib/ccf/libjs_generic.virtual.so
#6  0x00007fbff621a710 in ccf::NodeState::initiate_quote_generation()::{lambda(ccf::QuoteInfo const&, ccf::pal::snp::EndorsementEndpointsConfiguration const&)#1}::operator()(ccf::QuoteInfo const&, ccf::pal::snp::EndorsementEndpointsConfiguration const&) const () from /usr/lib/ccf/libjs_generic.virtual.so
#7  0x00007fbff61b8a20 in ccf::NodeState::create(StartType, StartupConfig&&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >&&) () from /usr/lib/ccf/libjs_generic.virtual.so
#8  0x00007fbff61b5d61 in ccf::Enclave::create_new_node(StartType, StartupConfig&&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >&&, unsigned char*, unsigned long, unsigned long*, unsigned char*, unsigned long, unsigned long*) () from /usr/lib/ccf/libjs_generic.virtual.so
#9  0x00007fbff61a6053 in enclave_create_node () from /usr/lib/ccf/libjs_generic.virtual.so
#10 0x00005555555ea4b9 in host::Enclave::create_node(EnclaveConfig const&, StartupConfig const&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >&&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >&, std::__1::vector<unsigned char, std::__1::allocator<unsigned char> >&, StartType, LoggerLevel, unsigned long, void*) ()
#11 0x00005555555c2971 in main ()
achamayou commented 4 days ago

The snapshot file appears to be corrupt, it is not clear it was corrupted during snapshot production, writing, or perhaps as result of non-atomic copies in blobfuse. One thing is clear though: the snapshot parser is insufficiently defensive.