Closed davepacheco closed 1 year ago
On helios-1.0.21316, using this build:
$ cockroach version
Build Tag: v21.2.9
Build Time: 2022/04/28 04:02:42
Distribution: OSS
Platform: illumos amd64 (x86_64-pc-solaris2.11)
Go Version: go1.16.10
C Compiler: gcc 10.3.0
Build Commit ID: 11787edfcfc157a0df951abc34684e4e18b3ef20
Build Type: release
(based on oxidecomputer/garbage-compactor#6)
I was trying to reproduce the test failures mentioned in #1223 using:
$ time while time cargo test -p omicron-nexus --lib -- db:: ; do :; done
warning: skipping duplicate package `header-check` found at `/home/dap/.cargo/git/checkouts/propolis-12517f89d3d9f483/c0776be/crates/bhyve-api/header-check`
Finished test [unoptimized + debuginfo] target(s) in 0.40s
Running unittests src/lib.rs (target/debug/deps/omicron_nexus-2954a006aadad259)
running 85 tests
test db::collection_attach::test::test_verify_query ... ok
test db::collection_attach::test::test_attach_missing_collection_fails ... ok
test db::collection_detach::test::test_verify_query ... ok
test db::collection_detach::test::test_detach_deleted_resource_fails ... ok
test db::collection_attach::test::test_attach_deleted_resource_fails ... ok
test db::collection_detach::test::test_detach_missing_collection_fails ... ok
test db::collection_attach::test::test_attach_multiple_times ... ok
test db::collection_detach::test::test_detach_missing_resource_fails ... ok
test db::collection_detach::test::test_detach_once ... ok
test db::collection_attach::test::test_attach_without_update_filter ... ok
test db::collection_attach::test::test_attach_while_already_attached ... ok
test db::collection_detach_many::test::test_verify_query ... ok
test db::collection_attach::test::test_attach_beyond_capacity_fails ... ok
test db::collection_attach::test::test_attach_once ... ok
test db::collection_insert::test::test_verify_query ... ok
test db::datastore::region::test::test_extent_count ... ok
test db::collection_detach::test::test_detach_without_update_filter ... ok
test db::collection_attach::test::test_attach_missing_resource_fails ... ok
test db::collection_detach::test::test_detach_while_already_detached ... ok
test db::collection_attach::test::test_attach_once_synchronous ... ok
test db::collection_attach::test::test_attach_with_filters ... ok
test db::collection_detach_many::test::test_detach_filter_collection ... ok
test db::collection_detach_many::test::test_detach_missing_collection_fails ... ok
test db::collection_detach_many::test::test_detach_deleted_resource ... ok
test db::collection_detach_many::test::test_detach_many ... ok
test db::collection_detach_many::test::test_detach_missing_resource_succeeds ... ok
test db::collection_detach_many::test::test_detach_once ... ok
test db::datastore::test::test_deallocate_external_ip_is_idempotent ... ok
test db::collection_detach_many::test::test_detach_once_synchronous ... ok
test db::datastore::test::test_deallocate_external_ip_by_instance_id_is_idempotent ... ok
test db::collection_detach_many::test::test_detach_while_already_detached ... ok
test db::datastore::test::test_external_ip_check_constraints ... ok
test db::datastore::test::test_rack_initialize_is_idempotent ... ok
test db::datastore::test::test_queries_do_not_require_full_table_scan ... ok
test db::fixed_data::silo_user::test::test_silo_user_ids_are_valid ... ok
test db::fixed_data::role_builtin::test::test_fixed_role_data ... ok
test db::fixed_data::test::test_builtin_fleet_id_is_valid ... ok
test db::fixed_data::user_builtin::test::test_builtin_user_ids_are_valid ... ok
test db::datastore::test::test_project_creation ... ok
test db::collection_insert::test::test_collection_present ... ok
test db::collection_insert::test::test_collection_not_present ... ok
test db::datastore::test::test_region_allocation ... ok
test db::datastore::test::test_region_allocation_is_idempotent ... ok
test db::datastore::test::test_region_allocation_not_enough_datasets ... ok
test db::datastore::test::test_region_allocation_out_of_space_fails ... ok
test db::datastore::test::test_service_upsert ... ok
test db::datastore::test::test_session_methods ... ok
test db::datastore::test::test_sled_ipv6_address_allocation ... ok
test db::datastore::test::test_ssh_keys ... ok
test db::datastore::test::test_table_scan ... ok
test db::lookup::test::test_lookup ... ok
test db::explain::test::test_explain_async ... ok
test db::explain::test::test_explain_full_table_scan ... ok
test db::pagination::test::test_paginated_multicolumn_descending ... ok
test db::queries::network_interface::tests::test_first_available_address ... ok
test db::pagination::test::test_paginated_single_column_descending ... ok
test db::pagination::test::test_paginated_single_column_ascending ... ok
test db::pagination::test::test_paginated_multicolumn_ascending ... ok
test db::queries::external_ip::tests::test_ensure_pool_exhaustion_does_not_use_other_pool ... ok
test db::queries::external_ip::tests::test_ephemeral_and_snat_ips_do_not_overlap ... ok
test db::queries::external_ip::tests::test_insert_external_ip_is_idempotent ... ok
test db::queries::external_ip::tests::test_insert_external_ip_for_service_is_idempoent ... ok
test db::queries::external_ip::tests::test_insert_external_ip_for_service_is_idempotent_even_when_full ... ok
test db::queries::external_ip::tests::test_next_external_ip_allocation_and_exhaustion ... ok
test db::queries::network_interface::tests::test_last_address_offset ... ok
test db::queries::external_ip::tests::test_next_external_ip_for_service ... ok
test db::queries::external_ip::tests::test_next_external_ip_is_restricted_to_pools ... ok
test db::queries::external_ip::tests::test_next_external_ip_is_restricted_to_projects ... ok
test db::queries::external_ip::tests::test_next_external_ip_out_of_order_allocation_ok ... ok
test db::queries::external_ip::tests::test_next_external_ip_with_ephemeral_takes_whole_port_range ... ok
test db::queries::network_interface::tests::test_detect_ip_exhaustion ... ok
test db::queries::network_interface::tests::test_insert_multiple_vpcs_fails ... ok
test db::queries::network_interface::tests::test_insert_no_instance_fails ... ok
test db::queries::network_interface::tests::test_insert_multiple_vpc_subnets_succeeds ... ok
test db::queries::network_interface::tests::test_insert_request_exact_ip ... ok
test db::queries::network_interface::tests::test_insert_request_same_ip_fails ... ok
test db::queries::network_interface::tests::test_insert_running_instance_fails ... ok
test db::queries::network_interface::tests::test_insert_sequential_ip_allocation ... ok
test db::queries::network_interface::tests::test_insert_same_vpc_subnet_fails ... ok
test db::queries::network_interface::tests::test_limit_number_of_interfaces_per_instance_query ... ok
test db::queries::network_interface::tests::test_insert_with_duplicate_name_fails ... ok
test db::queries::next_item::tests::test_wrapping_next_item_query ... ok
test db::saga_recovery::test::test_failure_during_saga_can_be_recovered ... ok
test db::queries::vpc_subnet::test::test_filter_conflicting_vpc_subnet_ranges_query ... ok
test db::saga_recovery::test::test_successful_saga_does_not_replay_during_recovery ... ok
test result: ok. 85 passed; 0 failed; 0 ignored; 0 measured; 52 filtered out; finished in 23.09s
real 0m23.645s
user 1m53.948s
sys 0m50.398s
warning: skipping duplicate package `header-check` found at `/home/dap/.cargo/git/checkouts/propolis-12517f89d3d9f483/c0776be/crates/bhyve-api/header-check`
Finished test [unoptimized + debuginfo] target(s) in 0.40s
Running unittests src/lib.rs (target/debug/deps/omicron_nexus-2954a006aadad259)
running 85 tests
test db::collection_attach::test::test_verify_query ... ok
test db::collection_detach::test::test_detach_deleted_resource_fails ... ok
test db::collection_detach::test::test_verify_query ... ok
test db::collection_detach::test::test_detach_missing_resource_fails ... ok
test db::collection_attach::test::test_attach_missing_collection_fails ... ok
test db::collection_attach::test::test_attach_without_update_filter ... ok
test db::collection_detach::test::test_detach_missing_collection_fails ... ok
test db::collection_detach::test::test_detach_without_update_filter ... ok
test db::collection_attach::test::test_attach_with_filters ... ok
test db::collection_attach::test::test_attach_deleted_resource_fails ... ok
test db::collection_attach::test::test_attach_once_synchronous ... ok
test db::collection_detach_many::test::test_verify_query ... ok
test db::collection_detach::test::test_detach_once ... ok
test db::collection_attach::test::test_attach_while_already_attached ... ok
test db::collection_insert::test::test_verify_query ... ok
test db::datastore::region::test::test_extent_count ... ok
test db::collection_attach::test::test_attach_multiple_times ... ok
test db::collection_detach::test::test_detach_while_already_detached ... ok
test db::collection_attach::test::test_attach_once ... ok
test db::collection_attach::test::test_attach_missing_resource_fails ... ok
test db::collection_attach::test::test_attach_beyond_capacity_fails ... ok
test db::collection_detach_many::test::test_detach_deleted_resource ... ok
test db::collection_detach_many::test::test_detach_filter_collection ... ok
test db::collection_detach_many::test::test_detach_many ... ok
test db::collection_insert::test::test_collection_present ... ok
test db::datastore::test::test_deallocate_external_ip_is_idempotent ... ok
test db::collection_detach_many::test::test_detach_missing_collection_fails ... ok
test db::datastore::test::test_deallocate_external_ip_by_instance_id_is_idempotent ... ok
test db::datastore::test::test_queries_do_not_require_full_table_scan ... ok
test db::datastore::test::test_project_creation ... ok
test db::datastore::test::test_session_methods ... FAILED
test db::collection_detach_many::test::test_detach_while_already_detached ... ok
test db::collection_detach_many::test::test_detach_once_synchronous ... ok
test db::fixed_data::role_builtin::test::test_fixed_role_data ... ok
test db::fixed_data::silo_user::test::test_silo_user_ids_are_valid ... ok
test db::fixed_data::test::test_builtin_fleet_id_is_valid ... ok
test db::fixed_data::user_builtin::test::test_builtin_user_ids_are_valid ... ok
test db::collection_detach_many::test::test_detach_missing_resource_succeeds ... ok
test db::datastore::test::test_rack_initialize_is_idempotent ... ok
test db::collection_detach_many::test::test_detach_once ... ok
test db::collection_insert::test::test_collection_not_present ... ok
test db::datastore::test::test_region_allocation ... ok
test db::datastore::test::test_region_allocation_is_idempotent ... ok
test db::datastore::test::test_region_allocation_not_enough_datasets ... ok
test db::datastore::test::test_service_upsert ... ok
test db::datastore::test::test_region_allocation_out_of_space_fails ... ok
test db::datastore::test::test_sled_ipv6_address_allocation ... ok
test db::datastore::test::test_table_scan ... ok
test db::datastore::test::test_ssh_keys ... ok
test db::explain::test::test_explain_async ... ok
test db::explain::test::test_explain_full_table_scan ... ok
test db::lookup::test::test_lookup ... ok
test db::pagination::test::test_paginated_multicolumn_ascending ... ok
test db::pagination::test::test_paginated_multicolumn_descending ... ok
test db::queries::network_interface::tests::test_first_available_address ... ok
test db::pagination::test::test_paginated_single_column_ascending ... ok
test db::pagination::test::test_paginated_single_column_descending ... ok
test db::queries::external_ip::tests::test_ensure_pool_exhaustion_does_not_use_other_pool ... ok
test db::queries::external_ip::tests::test_ephemeral_and_snat_ips_do_not_overlap ... ok
test db::queries::external_ip::tests::test_insert_external_ip_for_service_is_idempotent_even_when_full ... ok
test db::queries::external_ip::tests::test_insert_external_ip_is_idempotent ... ok
test db::queries::external_ip::tests::test_insert_external_ip_for_service_is_idempoent ... ok
test db::queries::external_ip::tests::test_next_external_ip_allocation_and_exhaustion ... ok
test db::queries::external_ip::tests::test_next_external_ip_for_service ... ok
test db::queries::network_interface::tests::test_last_address_offset ... ok
test db::queries::network_interface::tests::test_insert_no_instance_fails ... ok
test db::queries::external_ip::tests::test_next_external_ip_is_restricted_to_projects ... ok
test db::queries::external_ip::tests::test_next_external_ip_is_restricted_to_pools ... ok
test db::queries::external_ip::tests::test_next_external_ip_out_of_order_allocation_ok ... ok
test db::queries::external_ip::tests::test_next_external_ip_with_ephemeral_takes_whole_port_range ... ok
test db::queries::network_interface::tests::test_insert_multiple_vpcs_fails ... ok
test db::queries::network_interface::tests::test_detect_ip_exhaustion ... ok
test db::queries::network_interface::tests::test_insert_multiple_vpc_subnets_succeeds ... ok
test db::queries::network_interface::tests::test_insert_request_exact_ip ... ok
test db::queries::network_interface::tests::test_insert_request_same_ip_fails ... ok
test db::queries::network_interface::tests::test_insert_running_instance_fails ... ok
test db::queries::network_interface::tests::test_insert_same_vpc_subnet_fails ... ok
test db::queries::network_interface::tests::test_insert_sequential_ip_allocation ... ok
test db::queries::network_interface::tests::test_insert_with_duplicate_name_fails ... ok
test db::queries::network_interface::tests::test_limit_number_of_interfaces_per_instance_query ... ok
test db::queries::vpc_subnet::test::test_filter_conflicting_vpc_subnet_ranges_query ... ok
test db::queries::next_item::tests::test_wrapping_next_item_query ... ok
test db::saga_recovery::test::test_failure_during_saga_can_be_recovered ... ok
test db::saga_recovery::test::test_successful_saga_does_not_replay_during_recovery ... ok
test db::datastore::test::test_external_ip_check_constraints has been running for over 60 seconds
The first full run took 23s (see above). The second run has been hung for 17+ hours. Where is it?
$ ptree 753
535 /usr/sbin/sshd
750 /usr/sbin/sshd -R
752 /usr/sbin/sshd -R
753 -bash
4378 /home/dap/.rustup/toolchains/nightly-2022-09-27-x86_64-unknown-illumos/bin/carg
4379 /home/dap/omicron-cockroachdb/target/debug/deps/omicron_nexus-2954a006aadad259
4679 cockroach version --build-tag
It's been stuck for 17 hours:
$ ps -opid,stime,etime,args -p 4679
PID STIME ELAPSED COMMAND
4679 17:15:29 17:33:39 cockroach version --build-tag
Based on prstat, it's not on CPU. It's used very little CPU time:
$ ptime -p 4679
4679: cockroach version --build-tag
real 17:34:08.114941095
user 0.079358992
sys 0.059242551
and those "user" and "sys" numbers aren't going up. What's it doing?
$ pstack 4679
4679: cockroach version --build-tag
--------------------- thread# 1 / lwp# 1 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b440) + 10
fffffc7fee1c7782 sem_wait (c15b440) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124cfe5 runtime.lock2 () + 185
000000000127f21b runtime.printlock () + 5b
000000000127e010 runtime.dopanic_m () + 290
000000000127d9c7 runtime.fatalthrow.func1 () + 67
00000000012b4bb2 runtime.systemstack () + 52
000000000127d6b4 runtime.throw () + 74
0000000001268e5e runtime.(*sweepLocked).sweep () + 99e
0000000001259d25 runtime.(*mcentral).cacheSpan () + 4a5
000000000125908a runtime.(*mcache).refill () + aa
000000000124e54d runtime.(*mcache).nextFree () + 8d
000000000124ec30 runtime.mallocgc () + 530
0000000001250d65 runtime.mapassign () + 485
0000000004e2f63d github.com/aws/aws-sdk-go/aws/endpoints.init () + 4d19d
000000000128e949 runtime.doInit () + 129
000000000128e89e runtime.doInit () + 7e
000000000128e89e runtime.doInit () + 7e
000000000128e89e runtime.doInit () + 7e
000000000128e89e runtime.doInit () + 7e
000000000128e89e runtime.doInit () + 7e
00000000012801c5 runtime.main () + 205
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 2 / lwp# 2 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b640) + 10
fffffc7fee1c7782 sem_wait (c15b640) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124cfe5 runtime.lock2 () + 185
000000000128c82d runtime.sysmon () + 12d
0000000001283197 runtime.mstart1 () + 97
00000000012830e6 runtime.mstart0 () + 66
00000000012b4a85 runtime.mstart () + 5
fffffc7fee1e0e4c _thrp_setup (fffffc7fed240240) + 6c
fffffc7fee1e1160 _lwp_start ()
--------------------- thread# 3 / lwp# 3 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b500) + 10
fffffc7fee1c7782 sem_wait (c15b500) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 4 / lwp# 4 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b480) + 10
fffffc7fee1c7782 sem_wait (c15b480) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 5 / lwp# 5 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b4c0) + 10
fffffc7fee1c7782 sem_wait (c15b4c0) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
0000000001284766 runtime.templateThread () + 86
0000000001283197 runtime.mstart1 () + 97
00000000012830e6 runtime.mstart0 () + 66
00000000012b4a85 runtime.mstart () + 5
fffffc7fee1e0e4c _thrp_setup (fffffc7fed241a40) + 6c
fffffc7fee1e1160 _lwp_start ()
--------------------- thread# 6 / lwp# 6 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b680) + 10
fffffc7fee1c7782 sem_wait (c15b680) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 7 / lwp# 7 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b540) + 10
fffffc7fee1c7782 sem_wait (c15b540) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
000000000128509b runtime.stoplockedm () + 7b
000000000128702f runtime.schedule () + 4f
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
0000000001291a90 runtime.selectgo () + 7b0
00000000012ae3b2 runtime.ensureSigM.func1 () + 1f2
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 8 / lwp# 8 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b580) + 10
fffffc7fee1c7782 sem_wait (c15b580) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d386 runtime.notetsleep_internal () + 66
000000000124d687 runtime.notetsleepg () + 67
00000000012b228b os/signal.signal_recv () + ab
0000000001d27a65 os/signal.loop () + 25
00000000012b6d41 runtime.goexit () + 1
--------------------- thread# 9 / lwp# 9 ---------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b5c0) + 10
fffffc7fee1c7782 sem_wait (c15b5c0) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 10 / lwp# 10 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b600) + 10
fffffc7fee1c7782 sem_wait (c15b600) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001285365 runtime.startlockedm () + 85
000000000128706e runtime.schedule () + 8e
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 11 / lwp# 11 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b700) + 10
fffffc7fee1c7782 sem_wait (c15b700) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 12 / lwp# 12 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b780) + 10
fffffc7fee1c7782 sem_wait (c15b780) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 13 / lwp# 13 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b6c0) + 10
fffffc7fee1c7782 sem_wait (c15b6c0) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124cfe5 runtime.lock2 () + 185
00000000012849b0 runtime.startm () + 50
0000000001284fe6 runtime.wakep () + 66
0000000001286c19 runtime.resetspinning () + 59
00000000012872a7 runtime.schedule () + 2c7
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 14 / lwp# 14 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b740) + 10
fffffc7fee1c7782 sem_wait (c15b740) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124cfe5 runtime.lock2 () + 185
000000000124d7ab runtime.lockWithRank () + 2b
000000000124ce34 runtime.lock () + 34
000000000127dd3e runtime.startpanic_m () + 17e
000000000127d9a5 runtime.fatalthrow.func1 () + 45
000000000127d93e runtime.fatalthrow () + 5e
000000000127d6b4 runtime.throw () + 74
0000000001269505 runtime.(*mspan).reportZombies () + 345
000000000126881a runtime.(*sweepLocked).sweep () + 35a
000000000125a18f runtime.(*mcentral).uncacheSpan () + cf
0000000001259694 runtime.(*mcache).releaseAll () + 134
00000000012597a6 runtime.(*mcache).prepareForSweep () + 46
00000000012adb6f runtime.gcMarkTermination.func4.1 () + 2f
000000000128368a runtime.forEachP () + 12a
00000000012adbcd runtime.gcMarkTermination.func4 () + 2d
00000000012b4bb2 runtime.systemstack () + 52
000000000125e435 runtime.gcMarkDone () + 2d5
000000000125f7ac runtime.gcBgMarkWorker () + 30c
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 15 / lwp# 15 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b7c0) + 10
fffffc7fee1c7782 sem_wait (c15b7c0) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 16 / lwp# 16 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b800) + 10
fffffc7fee1c7782 sem_wait (c15b800) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 17 / lwp# 17 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b940) + 10
fffffc7fee1c7782 sem_wait (c15b940) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001285365 runtime.startlockedm () + 85
000000000128706e runtime.schedule () + 8e
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 18 / lwp# 18 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b880) + 10
fffffc7fee1c7782 sem_wait (c15b880) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
0000000001247377 runtime.chanrecv () + 5f7
0000000001246d2b runtime.chanrecv1 () + 2b
00000000043b511e github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1 () + 1de
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 19 / lwp# 19 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b8c0) + 10
fffffc7fee1c7782 sem_wait (c15b8c0) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 20 / lwp# 20 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b840) + 10
fffffc7fee1c7782 sem_wait (c15b840) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
-------------------- thread# 21 / lwp# 21 --------------------
fffffc7fee1e11a7 lwp_park (0, 0, 0)
fffffc7fee1d39a0 sema_wait (c15b900) + 10
fffffc7fee1c7782 sem_wait (c15b900) + 22
00000000012b853a runtime.asmsysvicall6 () + 5a
000000000124d273 runtime.notesleep () + 93
00000000012832b9 runtime.mPark () + 39
0000000001284892 runtime.stopm () + 92
0000000001286047 runtime.findrunnable () + a07
0000000001287277 runtime.schedule () + 297
000000000128788e runtime.park_m () + 18e
00000000012b4b23 runtime.mcall () + 63
000000000125f5b8 runtime.gcBgMarkWorker () + 118
00000000012b6d41 runtime.goexit () + 1
Most interesting to me are thread 1 (which appears to be blocked on a lock while panicking from the Go allocator) and thread 14 (which appears to be blocked on a lock while panicking during garbage collection). I've saved a core file, though I'm not sure what we'll be able to do with it. I'm not sure what other live state we might be able to gather here. In case it's useful, here are the open files:
$ pfiles 4679
4679: cockroach version --build-tag
Current rlimit: 65536 file descriptors
0: S_IFCHR mode:0620 dev:543,0 ino:727759809 uid:501 gid:7 rdev:169,2
O_RDWR|O_NOCTTY|O_LARGEFILE
/dev/pts/2
offset:37700
1: S_IFIFO mode:0000 dev:540,0 ino:8440 uid:501 gid:10 rdev:0,0
O_RDWR
2: S_IFIFO mode:0000 dev:540,0 ino:8441 uid:501 gid:10 rdev:0,0
O_RDWR
3: S_IFDOOR mode:0444 dev:544,0 ino:48 uid:0 gid:0 rdev:546,0
O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[313]
4: S_IFPORT mode:0000 dev:554,0 uid:501 gid:10 size:0
5: S_IFIFO mode:0000 dev:540,0 ino:8441 uid:501 gid:10 rdev:0,0
O_RDWR FD_CLOEXEC
stdout and stderr are pipes to the Rust test runner so we won't see the output until/unless I kill this process. In case for whatever reason that doesn't happen, I've saved a core file of the Rust program, though I'm also not sure what we'll be able to do with that.
I'm not sure what other state will be useful so I'm going to collect a bunch:
The test program is logging to this file (found by looking at pfiles 4379
, as well as by looking at most recently modified files in the temporary directory:
$ bunyan /dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_external_ip_check_constraints.4379.28.log
[2022-10-26T00:15:28.648515742Z] INFO: test_external_ip_check_constraints/4379 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/.tmpwULUmx
[2022-10-26T00:15:28.648902154Z] INFO: test_external_ip_check_constraints/4379 on ivanova: cockroach: copying from seed directory (/home/dap/omicron-cockroachdb/target/debug/build/nexus-test-utils-294866f3742b6400/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/.tmpwULUmx/data)
[2022-10-26T00:15:28.692127693Z] INFO: test_external_ip_check_constraints/4379 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/.tmpwULUmx/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/.tmpwULUmx/listen-url
$ find /dangerzone/omicron_tmp/.tmpwULUmx
/dangerzone/omicron_tmp/.tmpwULUmx
/dangerzone/omicron_tmp/.tmpwULUmx/cockroachdb_stderr
/dangerzone/omicron_tmp/.tmpwULUmx/cockroachdb_stdout
/dangerzone/omicron_tmp/.tmpwULUmx/data
/dangerzone/omicron_tmp/.tmpwULUmx/data/000006.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/LOCK
/dangerzone/omicron_tmp/.tmpwULUmx/data/000008.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/STORAGE_MIN_VERSION
/dangerzone/omicron_tmp/.tmpwULUmx/data/CURRENT
/dangerzone/omicron_tmp/.tmpwULUmx/data/marker.format-version.000005.006
/dangerzone/omicron_tmp/.tmpwULUmx/data/cockroach.listen-addr
/dangerzone/omicron_tmp/.tmpwULUmx/data/000009.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-stderr.ivanova.dap.2022-10-24T22_41_08Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-health.ivanova.dap.2022-10-24T22_41_09Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/heap_profiler
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/heap_profiler/memprof.2022-10-24T22_41_18.527.64308488.pprof
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-pebble.ivanova.dap.2022-10-24T22_41_08Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/goroutine_dump
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-security.ivanova.dap.2022-10-24T22_41_09Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-health.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/inflight_trace_dump
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-sql-schema.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach.ivanova.dap.2022-10-24T22_41_08Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-pebble.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-sql-schema.ivanova.dap.2022-10-24T22_41_08Z.016655.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-stderr.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/logs/cockroach-security.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/000007.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/cockroach.advertise-sql-addr
/dangerzone/omicron_tmp/.tmpwULUmx/data/cockroach.sql-addr
/dangerzone/omicron_tmp/.tmpwULUmx/data/temp-dirs-record.txt
/dangerzone/omicron_tmp/.tmpwULUmx/data/cockroach.http-addr
/dangerzone/omicron_tmp/.tmpwULUmx/data/OPTIONS-000003
/dangerzone/omicron_tmp/.tmpwULUmx/data/000004.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/MANIFEST-000001
/dangerzone/omicron_tmp/.tmpwULUmx/data/000005.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/000002.log
/dangerzone/omicron_tmp/.tmpwULUmx/data/cockroach.advertise-addr
/dangerzone/omicron_tmp/.tmpwULUmx/data/auxiliary
/dangerzone/omicron_tmp/.tmpwULUmx/data/marker.manifest.000001.MANIFEST-000001
$ cat /dangerzone/omicron_tmp/.tmpwULUmx/cockroachdb_stderr
$ cat /dangerzone/omicron_tmp/.tmpwULUmx/cockroachdb_stdout
I didn't find anything of particular note in the logs. I think this makes sense because we haven't actually started CockroachDB yet -- this is all copied from the seed directory. But I'll tar up the whole directory while it remains hung, just in case:
$ tar czf issue-1876-while-hung.tgz /dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_external_ip_check_constraints.4379.28.log /dangerzone/omicron_tmp/.tmpwULUmx
I'm about out of ideas for what to collect from the running state. I'm going to try sending SIGABRT to the Go process, as I think that's supposed to dump out some debugging stuff. I expect at that point we'll get any stdout/stderr that was emitted.
Well, I killed the process with:
$ kill -ABRT 4679
$
and the test run immediately completed with:
test db::datastore::test::test_external_ip_check_constraints ... FAILED
failures:
---- db::datastore::test::test_session_methods stdout ----
log file: "/dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_session_methods.4379.37.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_session_methods.4379.37.log"
thread 'db::datastore::test::test_session_methods' panicked at 'called `Result::unwrap()` on an `Err` value: BadVersion { expected: "v22.1.9-dirty", found: Err(error Some(2) when checking CockroachDB version) }', /home/dap/omicron-cockroachdb/test-utils/src/dev/mod.rs:142:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- db::datastore::test::test_external_ip_check_constraints stdout ----
log file: "/dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_external_ip_check_constraints.4379.28.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_external_ip_check_constraints.4379.28.log"
thread 'db::datastore::test::test_external_ip_check_constraints' panicked at 'called `Result::unwrap()` on an `Err` value: BadVersion { expected: "v22.1.9-dirty", found: Err(error Some(2) when checking CockroachDB version) }', /home/dap/omicron-cockroachdb/test-utils/src/dev/mod.rs:142:42
failures:
db::datastore::test::test_external_ip_check_constraints
db::datastore::test::test_session_methods
test result: FAILED. 83 passed; 2 failed; 0 ignored; 0 measured; 52 filtered out; finished in 64223.65s
error: test failed, to rerun pass `-p omicron-nexus --lib`
real 1070m24.249s
user 1m49.638s
sys 0m50.461s
real 1070m47.894s
user 3m43.586s
sys 1m40.859s
And there's nothing new in the test log. I forgot that we don't dump the stdout/stderr of that command on failure -- we really should.
I've also tar'd up the relevant test directories, to the extent that they still exist:
$ tar czf issue-1876-after-kill.tgz "/dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_session_methods.4379.37.log" /dangerzone/omicron_tmp/.tmpzE0ilb /dangerzone/omicron_tmp/omicron_nexus-2954a006aadad259-test_external_ip_check_constraints.4379.28.log /dangerzone/omicron_tmp/.tmpwULUmx
tar: /dangerzone/omicron_tmp/.tmpzE0ilb: No such file or directory
tar: /dangerzone/omicron_tmp/.tmpwULUmx: No such file or directory
The individual core files are too large for GitHub and it also doesn't support tarballs, so I've put this on our lab machine "catacomb" in /data/staff/core/issues/omicron-1876
. I can make these publicly available if anybody outside Oxide wants to have a look.
I'd say this is indeed illumos#15254 (#1146) and possibly a second bug in the Go runtime that I'll file separately.
From the core file: in thread 1, we can find the throw
argument to see what the error was. It's in sweepUnlocked
-- is it one of the failure modes described in #1146? mdb doesn't show the function arguments in the stack trace, but we can look at the call site to runtime.throw
at runtime.(*sweepLocked).sweep () + 99e
(from the stack trace). The first argument to throw
is usually a statically-defined string. Sure enough:
> ::nm ! grep sweepLocked.*sweep
0x00000000012684c0|0x0000000000000cfc|FUNC |LOCL |0x0 |15 |runtime.(*sweepLocked).sweep
0x000000000868a700|0x0000000000000050|OBJT |LOCL |0x0 |18 |runtime.(*sweepLocked).sweep.stkobj
> 0x00000000012684c0+0x99e::dis
runtime.(*sweepLocked).sweep+0x96a: movzwl %ax,%eax
runtime.(*sweepLocked).sweep+0x96d: movq %rax,(%rsp)
runtime.(*sweepLocked).sweep+0x971: call +0x16a4a <runtime.printuint>
runtime.(*sweepLocked).sweep+0x976: call +0x16645 <runtime.printnl>
runtime.(*sweepLocked).sweep+0x97b: nopl 0x0(%rax,%rax)
runtime.(*sweepLocked).sweep+0x980: call +0x163fb <runtime.printunlock>
runtime.(*sweepLocked).sweep+0x985: leaq +0x4e49cdf(%rip),%rax <0x60b2b2b>
runtime.(*sweepLocked).sweep+0x98c: movq %rax,(%rsp)
runtime.(*sweepLocked).sweep+0x990: movq $0x20,0x8(%rsp)
runtime.(*sweepLocked).sweep+0x999: call +0x147e2 <runtime.throw>
runtime.(*sweepLocked).sweep+0x99e: incq %rax
runtime.(*sweepLocked).sweep+0x9a1: movq 0x38(%rbx),%rcx
runtime.(*sweepLocked).sweep+0x9a5: addq $0x7,%rcx
runtime.(*sweepLocked).sweep+0x9a9: shrq $0x3,%rcx
runtime.(*sweepLocked).sweep+0x9ad: cmpq %rcx,%rax
runtime.(*sweepLocked).sweep+0x9b0: jae -0x62c <runtime.(*sweepLocked).sweep+0x38a>
runtime.(*sweepLocked).sweep+0x9b6: movq 0x50(%rbx),%rcx
runtime.(*sweepLocked).sweep+0x9ba: movzbl (%rax,%rcx),%ecx
runtime.(*sweepLocked).sweep+0x9be: movq 0x48(%rbx),%r10
runtime.(*sweepLocked).sweep+0x9c2: movzbl (%rax,%r10),%r10d
runtime.(*sweepLocked).sweep+0x9c7: notl %r10d
The call site is at +999 (one instruction before the return address of +99e), and shortly before that we load a non-local address into %rax and put it on to the stack:
> 0x60b2b2b,0t32/c
0x60b2b2b: sweep increased allocation count
(I cheated and figured out the string length here ahead of time. That's because the actual string is not NULL-terminated in memory so if we tell mdb to print it as a string, we get a much longer string of unrelated strings.)
So far, this looks like one of the variants of #1146. Let's take a look at this span. I loaded the same ::typedef data that I used in #1146. Let's grab the argument to sweep()
from the stack and print it as an mspan. (Note that while doing this, I think I found a bug in pstack
that it's printing the frame pointers wrong for some reason. The pstack
output shows the frame pointer for sweep()
as 0000000001268e5e, while mdb shows it as 000000c000b03d20. The latter makes sense; the former doesn't.) The argument wasn't quite where I expected it to be. I dumped the memory around the frame pointer:
> 000000c000b03d20-0t64,16/nap
0xc000b03ce0:
0xc000b03ce0: 0xfffffc7fe96e6130
0xc000b03ce8: 0
0xc000b03cf0: 0x160125999b
0xc000b03cf8: 0xfffffc7fee95acb8
0xc000b03d00: 0
0xc000b03d08: 0
0xc000b03d10: 0
0xc000b03d18: 7
0xc000b03d20: 0xc000b03da0
0xc000b03d28: runtime.(*mcentral).cacheSpan+0x4a5
0xc000b03d30: 0xc000b03d60
0xc000b03d38: 0xfffffc7fe96e6001
0xc000b03d40: 0xb00000000
0xc000b03d48: 0x101259133
0xc000b03d50: 0x59
0xc000b03d58: 0x10000000a
0xc000b03d60: 0xfffffc7fe96e60b0
0xc000b03d68: 0xfffffc7feed3a300
0xc000b03d70: 0xfffffc7fe96e60b0
0xc000b03d78: runtime.(*mcentral).uncacheSpan+0xb5
0xc000b03d80: cockroach`runtime.mheap_+0x11348
0xc000b03d88: 0xfffffc7feefe0760
I expected the argument to be one word ahead of the return address (cacheSpan
), but that's not a spanLocked
. What is is 0xfffffc7fee95acb8, a few pointers ahead of that. This looks like a valid mspan_t:
> 0xfffffc7fee95acb8::print -at mspan_t
fffffc7fee95acb8 mspan_t {
fffffc7fee95acb8 void *next = 0
fffffc7fee95acc0 void *prev = 0
fffffc7fee95acc8 void *list = 0
fffffc7fee95acd0 uintptr_t startAddr = 0xc000a74000
fffffc7fee95acd8 uintptr_t npages = 0x1
fffffc7fee95ace0 void *manualFreeList = 0
fffffc7fee95ace8 uintptr_t freeindex = 0x1
fffffc7fee95acf0 uintptr_t nelems = 0x9
fffffc7fee95acf8 uint64_t allocCache = 0x7ffffffffffffff2
fffffc7fee95ad00 void *allocBits = 0xfffffc7fef0c3110
fffffc7fee95ad08 void *gcmarkBits = 0xfffffc7feebe04b8
fffffc7fee95ad10 uint32_t sweepgen = 0xd
fffffc7fee95ad14 uint32_t divMul = 0x492493
fffffc7fee95ad18 uint16_t allocCount = 0x4
fffffc7fee95ad1a uint8_t spanclass = 0x3e
fffffc7fee95ad1b uint8_t state = 0x1
fffffc7fee95ad1c uint8_t needzero = 0x1
fffffc7fee95ad1e uint16_t allocCountBeforeCache = 0
fffffc7fee95ad20 uintptr_t elemsize = 0x380
fffffc7fee95ad28 uintptr_t limit = 0xc000a75f80
}
going by:
So: does allocBits look corrupted by the same root cause as #1146? Recall that the signature is a repeating 32-byte pattern, the last 16 bytes of which are always zero, and at least some of the leading 16 bytes are not zero.
> 0xfffffc7fef0c3110,0t96::dump -g8 -e
fffffc7fef0c3110: 000000000000001a 0000000000000000
fffffc7fef0c3120: 0000000000000001 0000000000000000
fffffc7fef0c3130: ffffffffffffffff 6dda74db7555556d
fffffc7fef0c3140: 000002aaab69935a 0000000000000000
fffffc7fef0c3150: 0000000000000000 0000000000000000
fffffc7fef0c3160: 100218000009996a 0388000002001080
Not so much. What about gcmarkBits?
> fffffc7feebe04b8,0t96::dump -g 8 -e
fffffc7feebe04b8: 0000000000000000 3fd5feb0ebf0b599
fffffc7feebe04c8: 0000000000000000 0000000000000000
fffffc7feebe04d8: 0000000000000000 3fd5feb0ebf0b599
fffffc7feebe04e8: 0000000000000000 0000000000000000
fffffc7feebe04f8: 0000000000000000 3fd5feb0ebf0b599
fffffc7feebe0508: 0000000000000000 0000000000000000
Bingo. How long is this run? I used my new tool:
$ cargo run --release 2>/dev/null < $HOME/issues/omicron-1876-from-catacomb/memory.txt
found run of 37 at addr c000c22420:
found run of 2034 at addr fffffc7feebe01c0:
Sure enough, that second hit refers to a similarly-sized run (it covers most of the 64 KiB mapping) that starts shortly before the address in question.
This has all the hallmarks of illumos#15254 and is consistent with the analysis in #1146 for the "sweep increased allocation count" failure mode. So I think it's fair to say we hit this bug here, at which point all bets are off.
Remember that we also had thread 14 panicking from reportZombies -- another failure mode we've analyzed from the same problem. Again, looking through the stack (this time in the reportZombies frame), we find a valid-looking mspan (in the right place this time) whose allocBits and gcmarkBits are both in that same region:
> fffffc7fe65ffd10,0t4/np
0xfffffc7fe65ffd10:
0xfffffc7fe65ffdf8
runtime.(*sweepLocked).sweep+0x35a
0xfffffc7fe96e53f0
0xfffffc7fef06bba0
> 0xfffffc7fe96e53f0::print -at mspan_t
fffffc7fe96e53f0 mspan_t {
fffffc7fe96e53f0 void *next = 0
fffffc7fe96e53f8 void *prev = 0
fffffc7fe96e5400 void *list = 0
fffffc7fe96e5408 uintptr_t startAddr = 0xc000ff0000
fffffc7fe96e5410 uintptr_t npages = 0x1
fffffc7fe96e5418 void *manualFreeList = 0
fffffc7fe96e5420 uintptr_t freeindex = 0x1
fffffc7fe96e5428 uintptr_t nelems = 0x38
fffffc7fe96e5430 uint64_t allocCache = 0x7fffffffffffffff
fffffc7fe96e5438 void *allocBits = 0xfffffc7feebe01c8
fffffc7fe96e5440 void *gcmarkBits = 0xfffffc7feebe01c0
fffffc7fe96e5448 uint32_t sweepgen = 0x9
fffffc7fe96e544c uint32_t divMul = 0x1c71c72
fffffc7fe96e5450 uint16_t allocCount = 0x1
fffffc7fe96e5452 uint8_t spanclass = 0x16
fffffc7fe96e5453 uint8_t state = 0x1
fffffc7fe96e5454 uint8_t needzero = 0
fffffc7fe96e5456 uint16_t allocCountBeforeCache = 0
fffffc7fe96e5458 uintptr_t elemsize = 0x90
fffffc7fe96e5460 uintptr_t limit = 0xc000ff1f80
}
So we appear to be panicking twice in this same Go process in two different ways resulting from the %ymm0 corruption issue. I have not dug into the hang itself, but at this point we're far enough off the rails that I'm not sure how interesting the rest of it is. I'm filing a Go bug in case the Go team wants to dig further into the hang.
Filed golang/go#57420.
Also filed https://www.illumos.org/issues/15276 for the pstack issue.
While trying to reproduce #1223 with CockroachDB 22.1.9, I ran into a case where a test hung because the
cockroach version
command that we run during the test suite itself hung, apparently while panicking. Details coming.