Deleted instances remain as running in CRDB after propolis zone is already removed

oxidecomputer / omicron

Omicron: Oxide control plane

Mozilla Public License 2.0

240 stars 36 forks source link

Deleted instances remain as running in CRDB after propolis zone is already removed #3207

Open askfongjojo opened 1 year ago

askfongjojo commented 1 year ago

This happened when I used a script to invoke oxide instance stop for 10 instances successively in a loop, and then delete them with cli in another loop. The requests came back with success response (ignore the two destroyed instances - they will be covered in another ticket):

pisces-2:tests angela$ for i in {1..12}; do oxide instance delete --project try --instance sysbench-mysql-$i; done
success
()
success
()
success
()
success
()
success
()
success
()
success
()
success
()
success
()
error
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "730032c9-f928-4a5b-9244-bc36c14096ef", "content-length": "158", "date": "Tue, 23 May 2023 22:41:07 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "instance cannot be deleted in state \"destroyed\"", request_id: "730032c9-f928-4a5b-9244-bc36c14096ef" }
error
Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "dd8a7e48-c584-4fa8-afc8-f4fa7c0c9c7f", "content-length": "160", "date": "Tue, 23 May 2023 22:41:07 GMT"}; value: Error { error_code: Some("ObjectNotFound"), message: "not found: instance with name \"sysbench-mysql-11\"", request_id: "dd8a7e48-c584-4fa8-afc8-f4fa7c0c9c7f" }
success
()

However, these instances are now "shell" records in the instance table - the propolis zones are already gone from the sled:

root@[fd00:1122:3344:106::6]:32221/omicron> select name, state, id, active_sled_id, active_propolis_id from instance where name like 'sysbench%' and state = 'running';
        name        |   state   |                  id                  |            active_sled_id            |          active_propolis_id
--------------------+-----------+--------------------------------------+--------------------------------------+---------------------------------------
  sysbench-mysql-5  | running   | 0dcd991a-36de-4a15-a999-a6fda4efcc92 | aa7c82d9-6e59-406e-b1e3-3648890b4bec | 21078e19-c1e0-4aeb-98aa-5c337352353d
  sysbench-mysql-2  | running   | 2ab5e713-5d25-486a-9083-d5b7cd1e870d | 4f4d89de-b2ba-4a96-b585-1ef2b7a31942 | 06995cdb-54cc-4923-aea8-0fc6f617277c
  sysbench-mysql-9  | running   | 32643226-e283-4fb6-8789-627f6d7383c9 | 434d5f7d-4dd7-42d2-9637-66406ad8d90e | bfe65687-6506-49f0-aa99-0dfc4d03134d
  sysbench-mysql-3  | running   | 4e63cbb7-a36a-4407-a04e-e5a54902b66d | 434d5f7d-4dd7-42d2-9637-66406ad8d90e | 04a7822b-696a-46eb-9a05-001db70b5ab8
  sysbench-mysql-4  | running   | 8ec571b6-1be2-4d56-90aa-0119db534e26 | 434d5f7d-4dd7-42d2-9637-66406ad8d90e | 4b41405d-e38b-444e-b734-39a1859ecfa2
  sysbench-mysql-6  | running   | b44622c9-fb61-4be1-b396-08c9525a2f71 | ec2016cd-6614-45bc-8108-50d5fe04947f | da6443bc-3e3c-4482-ac01-47c0f28f7d41
  sysbench-mysql-8  | running   | baad7458-72d7-4887-9672-b0ced70165de | 72208f88-1b50-4415-8106-e7f1be075f72 | 59846b67-ee59-43ba-bc15-89de88669b51
  sysbench-mysql-12 | running   | caac0098-45f3-4d41-b93c-40622a3b80f7 | 7430cdc2-5bbc-4a19-83af-1821192d787b | 26e0f80e-c8d0-4787-a47c-95b8d0147c62
  sysbench-mysql-7  | running   | eeb46abd-ab90-46e8-ad3e-20a45975bbbf | aa7c82d9-6e59-406e-b1e3-3648890b4bec | 9e88dbaa-822c-425b-b566-6b843956a4c3
  sysbench-mysql-1  | running   | f908e4dd-f6f9-4bcc-8be0-38226b02de60 | 7430cdc2-5bbc-4a19-83af-1821192d787b | 5f00ab9f-5c68-4e71-8ed9-95b46c664924
(12 rows)

I looked into one of the instances, sysbench-mysql-7 which is located on sled gc11. It was set to stopped at 22:40:03

22:40:03.053Z INFO 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): instance updated by sled agent
    instance_id = eeb46abd-ab90-46e8-ad3e-20a45975bbbf
    new_state = stopped
    propolis_id = 9e88dbaa-822c-425b-b566-6b843956a4c3
22:40:03.057Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): received new runtime state from sled agent
    instance_id = eeb46abd-ab90-46e8-ad3e-20a45975bbbf
    runtime_state = InstanceRuntimeState { run_state: Stopped, sled_id: aa7c82d9-6e59-406e-b1e3-3648890b4bec, propolis_id: 9e88dbaa-822c-425b-b566-6b843956a4c3, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:103::24]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(4), memory: ByteCount(17179869184), hostname: "sysbench-mysql7", gen: Generation(8), time_updated: 2023-05-23T22:40:03.055815407Z }
22:40:07.620Z INFO 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): instance updated by sled agent
    instance_id = eeb46abd-ab90-46e8-ad3e-20a45975bbbf
    new_state = stopped
    propolis_id = 9e88dbaa-822c-425b-b566-6b843956a4c3

The zone was removed at 22:40:08 according to the sled-agent log for gc11:

21:57:38.679Z INFO SledAgent (InstanceManager): Started propolis in zone: oxz_propolis-server_9e88dbaa-822c-425b-b566-6b843956a4c3
    instance_id = eeb46abd-ab90-46e8-ad3e-20a45975bbbf
22:40:08.497Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_9e88dbaa-822c-425b-b566-6b843956a4c3
    instance_id = eeb46abd-ab90-46e8-ad3e-20a45975bbbf

but the delete saga was created 44 seconds later, at 22:40:52, presumably because nexus was busy taking deletion requests of other instances:

22:40:52.231Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): authorize result
    action = Delete
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    local_addr = 172.30.1.5:443
    method = DELETE
    remote_addr = 172.20.17.42:61913
    req_id = d1b9476f-784b-4bc0-9d59-c551592ac7a2
    resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 549f44d7-b0cd-4532-b94d-9cae7baab5d3, lookup_type: ByName("try") }, key: eeb46abd-ab90-46e8-ad3e-20a45975bbbf, lookup_type: ByName("sysbench-mysql-7") }
    result = Ok(())
    uri = /v1/instances/sysbench-mysql-7?project=try
22:40:52.231Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): saga create
    dag = {"end_node":9,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[3,4,null],[4,5,null],[5,6,null],[6,7,null],[8,0,null],[7,9,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_delete.v2p_ensure_undo","label":"V2PEnsureUndo","name":"v2p_ensure_undo"}},{"Action":{"action_name":"instance_delete.v2p_ensure","label":"V2PEnsure","name":"v2p_ensure"}},{"Action":{"action_name":"instance_delete.delete_asic_configuration","label":"DeleteAsicConfiguration","name":"delete_asic_configuration"}},{"Action":{"action_name":"instance_delete.instance_delete_record","label":"InstanceDeleteRecord","name":"no_result1"}},{"Action":{"action_name":"instance_delete.delete_network_interfaces","label":"DeleteNetworkInterfaces","name":"no_result2"}},{"Action":{"action_name":"instance_delete.deallocate_external_ip","label":"DeallocateExternalIp","name":"no_result3"}},{"Action":{"action_name":"instance_delete.virtual_resources_account","label":"VirtualResourcesAccount","name":"no_result4"}},{"Action":{"action_name":"instance_delete.sled_resources_account","label":"SledResourcesAccount","name":"no_result5"}},{"Start":{"params":{"authz_instance":{"key":"eeb46abd-ab90-46e8-ad3e-20a45975bbbf","lookup_type":{"ByName":"sysbench-mysql-7"},"parent":{"key":"549f44d7-b0cd-4532-b94d-9cae7baab5d3","lookup_type":{"ByName":"try"},"parent":{"key":"001de000-5110-4000-8000-000000000000","lookup_type":{"ById":"001de000-5110-4000-8000-000000000000"},"parent":null}}},"instance":{"identity":{"description":"jammy load generator","id":"eeb46abd-ab90-46e8-ad3e-20a45975bbbf","name":"sysbench-mysql-7","time_created":"2023-05-23T21:57:24.310610Z","time_deleted":null,"time_modified":"2023-05-23T21:57:24.310610Z"},"project_id":"549f44d7-b0cd-4532-b94d-9cae7baab5d3","runtime_state":{"dst_propolis_id":null,"gen":8,"hostname":"sysbench-mysql7","memory":17179869184,"migration_id":null,"ncpus":4,"propolis_gen":2,"propolis_id":"9e88dbaa-822c-425b-b566-6b843956a4c3","propolis_ip":"fd00:1122:3344:103::24/128","sled_id":"aa7c82d9-6e59-406e-b1e3-3648890b4bec","state":"stopped","time_updated":"2023-05-23T22:40:03.055815Z"},...
22:40:54.908Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): authorize result
    action = Delete
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 549f44d7-b0cd-4532-b94d-9cae7baab5d3, lookup_type: ByName("try") }, key: eeb46abd-ab90-46e8-ad3e-20a45975bbbf, lookup_type: ByName("sysbench-mysql-7") }
    result = Ok(())
    saga_id = 5a04bdd4-7936-4bbf-a3ba-df1a54a10518
    saga_name = instance-delete
    saga_node = InstanceDeleteRecord
22:40:54.932Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): authorize result
    action = Modify
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 549f44d7-b0cd-4532-b94d-9cae7baab5d3, lookup_type: ByName("try") }, key: eeb46abd-ab90-46e8-ad3e-20a45975bbbf, lookup_type: ByName("sysbench-mysql-7") }
    result = Ok(())
    saga_id = 5a04bdd4-7936-4bbf-a3ba-df1a54a10518
    saga_name = instance-delete
    saga_node = DeleteNetworkInterfaces
22:40:54.957Z DEBG 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): authorize result
    action = Read
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 549f44d7-b0cd-4532-b94d-9cae7baab5d3, lookup_type: ByName("try") }, key: eeb46abd-ab90-46e8-ad3e-20a45975bbbf, lookup_type: ByName("sysbench-mysql-7") }
    result = Ok(())
    saga_id = 5a04bdd4-7936-4bbf-a3ba-df1a54a10518
    saga_name = instance-delete
    saga_node = SledResourcesAccount
22:40:54.966Z INFO 496eabae-738d-4497-a183-1b75bf912c7c (ServerContext): request completed
    local_addr = 172.30.1.5:443
    method = DELETE
    remote_addr = 172.20.17.42:61913
    req_id = d1b9476f-784b-4bc0-9d59-c551592ac7a2
    response_code = 204
    uri = /v1/instances/sysbench-mysql-7?project=try

I didn't find any saga error in the log but the saga also doesn't exist in the database:

root@[fd00:1122:3344:106::6]:32221/omicron> select * from saga where id = '5a04bdd4-7936-4bbf-a3ba-df1a54a10518';
  id | creator | time_created | name | saga_dag | saga_state | current_sec | adopt_generation | adopt_time
-----+---------+--------------+------+----------+------------+-------------+------------------+-------------
(0 rows)

At 22:42:40, I tried to recreate a VM with the same name but that failed with an object already exists error. These VMs are now ghost records and cannot be removed because the VMs don't have a backend and can't be stopped/deleted.

The complete nexus log is located in catacomb.eng.oxide.computer:/data/staff/dogfood/may-23/.

gjcolombo commented 1 year ago

The instance state transitions above and in the logs look more or less correct to me, although I'm similarly at a loss as to why the saga table query didn't return the deletion saga.

It's currently expected that an Instance record will continue to have an "active" sled ID and Propolis ID even when the Propolis zone has been torn down; #2315 proposes to change that behavior. It otherwise looks like the system behaved as intended--Nexus eventually moved the instance to Stopped, and the deletion saga successfully ran to completion; after that point the instance's time_deleted should have been set and the instance should have appeared to any attempt to query it to be deleted.

The saga execution log for the saga that tried to re-create the instance, on the other hand, has this suspicious-looking entry:

{"msg":"recording saga event","v":0,"name":"nexus","level":20,"time":"2023-05-23T15:42:40.371974129-07:00","hostname":"oxz_nexus","pid":4365,"component":"SecStore","component":"nexus","component":"ServerContext","name":"496eabae-738d-4497-a183-1b75bf912c7c","event_type":"Failed(ActionFailed { source_error: Object {\"ObjectAlreadyExists\": Object {\"object_name\": String(\"sysbench-mysql-7\"), \"type_name\": String(\"disk\")}} })","node_id":"2","saga_id":"dec6089b-5291-42ca-ba84-6874c2dd44a0"}

This suggests the already-extant object is a disk named sysbench-mysql-7 and not an instance. I don't know of a mechanism to ask Nexus to delete an instance's attached disks on instance deletion, so I think this is expected if the two instance create requests both asked to create a disk with this name.

@askfongjojo What mechanism did you use to try to re-create the instance? (Nexus doesn't appear to log the bodies of incoming HTTP requests, and I can't quite work out what value the "disks" parameter to the create request had in this case.) The instance creation API allows callers to specify whether they want to create a new disk or merely attach an existing one (in the InstanceDiskAttachment component); if you asked to "create" the disk both times, switching the re-creation attempt's disk request type to "attach" might resolve the issue.

askfongjojo commented 1 year ago

This suggests the already-extant object is a disk named sysbench-mysql-7 and not an instance. I don't know of a mechanism to ask Nexus to delete an instance's attached disks on instance deletion, so I think this is expected if the two instance create requests both asked to create a disk with this name.

Yeah, this can be confusing as I gave the disk the same name as the instance. The script I used creates both disk and instance and I just reran it as-is, expecting the disk creation to fail while the instance creation to succeed. This is something I've done many times last week with a loop count of 3; the only difference this time is that I have 12:

#!/bin/bash

export OXIDE_HOST=http://172.20.26.2/
export OXIDE_TOKEN=$1

if [[ -z $2 ]]; then
  app=sysbench-mysql
else
  app=$2
fi

for i in {1..12}; do
oxide api '/v1/disks?project=try' --method POST --input - <<EOF
  {
    "name": "$app-$i",
    "description": "ubuntu 22.04 mysql instance $i",
    "size": $(( 1024*1024*1024*100 )),
    "disk_source": {
      "type": "image",
      "image_id": "8ce4e6e9-afa6-4a6b-ab67-71d720655047"
    }
  }  
EOF

oxide api '/v1/instances?project=try' --method POST --input - <<EOF
{
  "name": "$app-$i",
  "description": "jammy load generator",
  "hostname": "$app$i",
  "ncpus": 4,
  "memory": $(( 1024*1024*1024*16 )),
  "disks": [
    {
      "type": "attach",
      "name": "$app-$i"
    }
  ],
  "network_interfaces": {
    "type": "default"
  },
  "external_ips": [
    {
      "type": "ephemeral",
      "pool_name": "default"
    }
  ],
  "user_data": "$(cat test.yaml | base64)",
  "start": true
}
EOF
done

The time_deleted is NULL for this instance so the saga has strangely declare done without updating the database, or actually updated the instance state back to running somehow.

root@[fd00:1122:3344:106::6]:32221/omicron> select * from instance where name = 'sysbench-mysql-7';
                   id                  |       name       |     description      |         time_created         |        time_modified         | time_deleted |              project_id              |                                                                                                                                                                                                                                                                                                                                                                                                                           user_data                                                                                                                                                                                                                                                                                                                                                                                                                           |  state  |      time_state_updated       | state_generation |            active_sled_id            |          active_propolis_id          |   active_propolis_ip   | target_propolis_id | migration_id | propolis_generation | ncpus |   memory    |    hostname
---------------------------------------+------------------+----------------------+------------------------------+------------------------------+--------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+------------------+--------------------------------------+--------------------------------------+------------------------+--------------------+--------------+---------------------+-------+-------------+------------------
  eeb46abd-ab90-46e8-ad3e-20a45975bbbf | sysbench-mysql-7 | jammy load generator | 2023-05-23 21:57:24.31061+00 | 2023-05-23 21:57:24.31061+00 | NULL         | 549f44d7-b0cd-4532-b94d-9cae7baab5d3 | #cloud-config\nmanage_etc_hosts: true\nruncmd:\n- \'curl https://gist.githubusercontent.com/askfongjojo/119fee7e6f53a43933342763b3428c04/raw/903cf93c021910545625377a6eb9a326d0869531/sysbench-userdata | /bin/bash -s\'\n- \'curl -o /home/ubuntu/sysbench-mysql-test.sh https://gist.githubusercontent.com/askfongjojo/6632a3f110d1189c30ce2190ff93e46e/raw/b50fdf52fb92fe970c50e4c4aad50d210065472b/sysbench-mysql-test\'\n- chmod +x /home/ubuntu/sysbench-mysql-test.sh\n- chown ubuntu:ubuntu /home/ubuntu/sysbench-mysql-test.sh\n- [ sh, -xc, "nohup /usr/share/sysbench/tpcc.lua --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=password --mysql-db=sbtest --time=1800 --threads=8 --report-interval=60 --tables=3 --scale=10 --db-driver=mysql prepare" ]\n- echo \'********** custom setup completed **********\'\n | running | 2023-05-23 21:57:44.642639+00 |                4 | aa7c82d9-6e59-406e-b1e3-3648890b4bec | 9e88dbaa-822c-425b-b566-6b843956a4c3 | fd00:1122:3344:103::24 | NULL               | NULL         |                   1 |     4 | 17179869184 | sysbench-mysql7
(1 row)

The real problem is that there is no way to stop the instance to make any changes.

askfongjojo commented 1 year ago

Ah, I realize my bug description was unclear about the issue. I've updated it to indicate the split-brain issue.

gjcolombo commented 1 year ago

Thanks for the extra details! I also realized just now that I missed the Nexus log lines that show that there was an instance-already-exists conflict when trying to re-create sysbench-mysql-7. Mea culpa for that one; I'm not sure why I concluded the disk conflict was the interesting bit.

CRDB's apparent behavior here scares me. The instance record in your comment here has a state_generation of 4, which is what I'd expect for a brand-new instance that's running for the very first time and hasn't changed state since it began to run. But the logs indicate that Nexus thinks this instance is stopped and that it successfully wrote that state change back to the database.

If I'm understanding things right, that means we've got two separate cases where Nexus recorded that it did something (wrote 'stopping'/'stopped' states for the instance, executed an instance delete saga) that should have been visible in CRDB, and two separate lines of observation (manual queries of CRDB, Nexus thinking the instance isn't deleted) that indicate that those writes aren't visible to subsequent reads. I can't imagine anything is going to work if CRDB's not persisting things Nexus thinks it's persisting.

If all this is correct, I think the next step here would be to see if the CRDB zone has logs or other context that indicate that it's having a problem.

gjcolombo commented 1 year ago

I have functionally zero CockroachDB expertise but do see the following unusual events in its logs from around the time the instance re-creation attempt failed:

W230523 22:42:36.866723 938 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 297  pusher ~@6a43640d~@: finished waiting after 329.98s for pushee ~@fea211f7~@
W230523 22:42:36.866769 3896 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 299  pusher ~@1786d14a~@: finished waiting after 329.98s for pushee ~@fea211f7~@
W230523 22:42:36.866755 1953 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 298  pusher ~@8dd4557c~@: finished waiting after 330.03s for pushee ~@fea211f7~@
W230523 22:42:36.866788 1901 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 300  pusher ~@3f1e1039~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866808 1965 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 301  pusher ~@64676875~@: finished waiting after 320.00s for pushee ~@fea211f7~@
W230523 22:42:36.866838 1967 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 302  pusher ~@94501441~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866830 1950 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 303  pusher ~@5b00071a~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866883 3802 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 304  pusher ~@d7275bcf~@: finished waiting after 308.83s for pushee ~@fea211f7~@
W230523 22:42:36.866892 1425 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 305  pusher ~@97ded7a6~@: finished waiting after 320.00s for pushee ~@fea211f7~@
W230523 22:42:36.866978 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 306  slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(94501441->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867010 1953 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 308  slow range RPC: have been waiting 330.03s (1 attempts) for RPC PushTxn(8dd4557c->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.866994 1950 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 309  slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(5b00071a->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867045 3896 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 310  slow range RPC: have been waiting 329.98s (1 attempts) for RPC PushTxn(1786d14a->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.867004 3802 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 307  slow range RPC: have been waiting 308.83s (1 attempts) for RPC PushTxn(d7275bcf->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.866999 1965 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 311  slow range RPC: have been waiting 320.00s (1 attempts) for RPC PushTxn(64676875->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.866992 938 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 312  slow range RPC: have been waiting 329.98s (1 attempts) for RPC PushTxn(6a43640d->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~@@
 [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.867101 1425 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 313  slow range RPC: have been waiting 320.00s (1 attempts) for RPC PushTxn(97ded7a6->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.867113 1901 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 314  slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(3f1e1039->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@ 
W230523 22:42:36.873449 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:60196,user=root] 315  slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 94501441], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.873921 1953 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:61138,user=root] 316  slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 8dd4557c], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.876522 3896 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:46871,user=root] 317  slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 1786d14a], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.876781 1901 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:33865,user=root] 318  slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 3f1e1039], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.876954 3802 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:44855,user=root] 319  slow range RPC: have been waiting 308.95s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: d7275bcf], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.877093 1950 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:55201,user=root] 320  slow range RPC: have been waiting 320.13s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 5b00071a], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ nse~@ 
W230523 22:42:36.878565 938 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:41806,user=root] 321  slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 6a43640d], [can-forward-ts] to rr
149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.880642 1965 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:40753,user=root] 322  slow range RPC: have been waiting 320.11s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 64676875], [can-forward-ts] to  
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ 
W230523 22:42:36.882630 1425 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:47622,user=root] 323  slow range RPC: have been waiting 320.11s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 97ded7a6] to r149:~@/Table/199
{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: WriteTooOldError: write for key /Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0 at timestamp 1684881426.741633018,0 too old; wrote at 1684881753.880126241,3: "sql txn" meta={id=97ded7a6 key=/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0 pri==
0.10225726 epo=0 ts=1684881753.880126241,3 min=1684881426.741633018,0 seq=0} lock=true stat=PENDING rts=1684881426.741633018,0 wto=false gul=1684881427.241633018,0)~@

To be clear, I have no idea what any of this really means or what was being waited on. But it looks like this is one of only two time ranges in the logs where "slow range RPC" errors occurred (the other was around 21:01:43Z), and the 320-second interval preceding these messages would have included the times where we tried to stop and delete these instances.

askfongjojo commented 1 year ago

Yes, the saga table recorded nothing for a whole 6-minute window when there were many orchestration activities:

root@[fd00:1122:3344:106::6]:32221/omicron> select time_created, adopt_time, saga_state from saga where date_trunc('second', "time_created")>= '2023-05-23 22:35:00' order by 1;
          time_created          |          adopt_time           | saga_state
--------------------------------+-------------------------------+-------------
  2023-05-23 22:36:13.380571+00 | 2023-05-23 22:36:13.380571+00 | done
  2023-05-23 22:42:37.581953+00 | 2023-05-23 22:42:37.581953+00 | done
  2023-05-23 22:42:37.900628+00 | 2023-05-23 22:42:37.900628+00 | done
  2023-05-23 22:42:38.25941+00  | 2023-05-23 22:42:38.25941+00  | done
 ...

The crdb smf service log showed that it started again at 22:35:41 though there wasn't anything that indicated why. The timestamps are very confusing though, as if it uses local TZ in one place and UTC timestamp in another.

CockroachDB node starting at 2023-05-23 16:54:12.092161156 +0000 UTC (took 1.5s)
build:               OSS v22.1.9-dirty @ 2022/10/26 21:17:46 (go1.17.13)
webui:               http://[fd00:1122:3344:106::6]:8080
sql:                 postgresql://root@[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable&user=root
RPC client flags:    /opt/oxide/cockroachdb/bin/cockroach <client cmd> --host=[fd00:1122:3344:106::6]:32221 --insecure
logs:                /data/logs
temp dir:            /data/cockroach-temp598479943
external I/O path:   /data/extern
store[0]:            path=/data
storage engine:      pebble
clusterID:           3ca480c2-56af-46cf-a6d0-72dfebc97ba6
status:              initialized new cluster
nodeID:              1
initiating graceful shutdown of server
server drained and shutdown completed
[ May 23 15:35:40 Stopping because all processes in service exited. ]
[ May 23 15:35:40 Executing stop method (:kill). ]
[ May 23 15:35:40 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
+ set -o errexit
+ set -o pipefail
+ . /lib/svc/share/smf_include.sh
++ SMF_EXIT_OK=0
++ SMF_EXIT_NODAEMON=94
++ SMF_EXIT_ERR_FATAL=95
++ SMF_EXIT_ERR_CONFIG=96
++ SMF_EXIT_MON_DEGRADE=97
++ SMF_EXIT_MON_OFFLINE=98
++ SMF_EXIT_ERR_NOSMF=99
++ SMF_EXIT_ERR_PERM=100
++ svcprop -c -p config/listen_addr svc:/oxide/cockroachdb:default
+ LISTEN_ADDR=fd00:1122:3344:106::6
++ svcprop -c -p config/listen_port svc:/oxide/cockroachdb:default
+ LISTEN_PORT=32221
++ svcprop -c -p config/store svc:/oxide/cockroachdb:default
+ DATASTORE=/data
++ svcprop -c -p config/datalink svc:/oxide/cockroachdb:default
+ DATALINK=oxControlService2
++ svcprop -c -p config/gateway svc:/oxide/cockroachdb:default
+ GATEWAY=fd00:1122:3344:106::1
+ [[ oxControlService2 == unknown ]]
+ [[ fd00:1122:3344:106::1 == unknown ]]
+ ipadm show-addr oxControlService2/ll
ADDROBJ           TYPE     STATE        ADDR
oxControlService2/ll addrconf ok        fe80::8:20ff:fe49:6741%oxControlService2/10
+ ipadm show-addr oxControlService2/omicron6
ADDROBJ           TYPE     STATE        ADDR
oxControlService2/omicron6 static ok    fd00:1122:3344:106::6/64
+ route get -inet6 default -inet6 fd00:1122:3344:106::1
   route to: default
destination: default
       mask: default
    gateway: fd00:1122:3344:106::1
  interface: oxControlService2
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
       0         0         0         0         0         0      9000         0 
+ args=('--insecure' '--listen-addr' "[$LISTEN_ADDR]:$LISTEN_PORT" '--store' "$DATASTORE")
+ exec /opt/oxide/cockroachdb/bin/cockroach start-single-node --insecure --listen-addr '[fd00:1122:3344:106::6]:32221' --store /data
[ May 23 15:35:40 Method "start" exited with status 0. ]
*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
* 
* This mode is intended for non-production testing only.
* 
* In this mode:
* - Your cluster is open to any client that can access fd00:1122:3344:106::6.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
* 
* - https://go.crdb.dev/issue-v/53404/v22.1
* - https://www.cockroachlabs.com/docs/v22.1/secure-a-cluster.html
*
CockroachDB node starting at 2023-05-23 22:35:41.741418229 +0000 UTC (took 1.0s)
build:               OSS v22.1.9-dirty @ 2022/10/26 21:17:46 (go1.17.13)
webui:               http://[fd00:1122:3344:106::6]:8080
sql:                 postgresql://root@[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable&user=root
RPC client flags:    /opt/oxide/cockroachdb/bin/cockroach <client cmd> --host=[fd00:1122:3344:106::6]:32221 --insecure
logs:                /data/logs
temp dir:            /data/cockroach-temp1934682193
external I/O path:   /data/extern
store[0]:            path=/data
storage engine:      pebble
clusterID:           3ca480c2-56af-46cf-a6d0-72dfebc97ba6
status:              restarted pre-existing node
nodeID:              1

There is some matching info from another log file:

root@oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd:~# tail -10 /data/logs/cockroach.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log
W230523 22:35:40.269931 437285 sql/sqlstats/persistedsqlstats/flush.go:116 ⋮ [n1,server drain process] 15760  ‹failed to flush statement statistics›: flush statement ‹4025915790256006928›'s statistics: ‹insert-stmt-stats›: cannot acquire lease when draining
W230523 22:35:40.270375 437285 sql/sqlstats/persistedsqlstats/flush.go:116 ⋮ [n1,server drain process] 15761  ‹failed to flush transaction statistics›: flushing transaction ‹11006041404254834895›'s statistics: ‹insert-txn-stats›: cannot acquire lease when draining
I230523 22:35:40.272237 437285 1@server/drain.go:285 ⋮ [n1,server drain process] 15762  drain remaining: 0
W230523 22:35:40.273187 438084 kv/kvserver/intentresolver/intent_resolver.go:793 ⋮ [-] 15763  failed to gc transaction record: could not GC completed transaction anchored at ‹/Table/43/1/1/2023-05-23T22:00:00Z/"D0\x80K\x8a^_\x84"/"$ internal-find-scheduled-jobs"/1›: ‹node unavailable; try another peer›
I230523 22:35:40.273769 970 sql/stats/automatic_stats.go:489 ⋮ [n1] 15764  quiescing auto stats refresher
W230523 22:35:40.273824 449 sql/sqlliveness/slinstance/slinstance.go:243 ⋮ [n1] 15765  exiting heartbeat loop
I230523 22:35:40.273946 224 jobs/registry.go:1205 ⋮ [n1] 15766  AUTO SPAN CONFIG RECONCILIATION job 867665352511619073: stepping through state succeeded with error: <nil>
W230523 22:35:40.273749 975 jobs/registry.go:811 ⋮ [n1] 15767  canceling all adopted jobs due to stopper quiescing
W230523 22:35:40.274424 224 kv/txn.go:728 ⋮ [n1] 15768  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: ‹log-job›: context canceled
I230523 22:35:40.281730 1 1@cli/start.go:899 ⋮ [n1] 15769  server drained and shutdown completed

davepacheco commented 1 year ago

I spent this afternoon looking at this without too much to show for it. I'll summarize what I did find below. Since I'm about out of time and I believe this system will be shutdown tonight for other reasons, I figured I'd collect what I could. Here are log files from Nexus and CockroachDB, plus a cockroach debug zip file:

dap@catacomb /data/staff/core/issues/omicron-3207 $ ls -l
total 71682
-rw-rw-r--+  1 dap      staff    10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+  1 dap      staff    25900155 May 24 23:52 omicron-3207.tgz

These were collected on gc19 (BRM42220017) using:

$ tar czvf omicron-3207.tgz /zone/oxz_nexus/root/var/svc/log /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/var/svc/log

and

/opt/oxide/cockroachdb/bin/cockroach debug zip --host '[fd00:1122:3344:106::6]' --port 32221  --insecure /data/dap-debug-zip.zip

I also tried to collect the whole Cockroach store directory. I can't tell if it worked:

BRM42220017 # tar czf /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/dap-data-dir.tgz /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data
tar: cockroach-health.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_41Z.004049.log: filename is greater than 100
tar: cockroach-health.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_12Z.007568.log: filename is greater than 100
tar: cockroach-pebble.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-security.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-telemetry.log: symbolic link too long
tar: cockroach-pebble.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_40Z.004049.log: filename is greater than 100
tar: cockroach-stderr.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_40Z.004049.log: filename is greater than 100
tar: cockroach-sql-schema.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-stderr.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-pebble.log: symbolic link too long
tar: cockroach-stderr.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-24T06_37_42Z.004049.log: filename is greater than 100
tar: cockroach-security.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_12Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-sql-schema.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-health.log: symbolic link too long
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_41Z.004049.log: filename is greater than 100
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-24T22_39_41Z.004049.log: filename is greater than 100
BRM42220017 # echo $?
1
BRM42220017 #

I suspect we got the main logs but not the ones that were called out. Anyway, that's here now:

dap@catacomb ~ $ ls -l /data/staff/core/issues/omicron-3207/
total 163819
-rw-rw-r--+  1 dap      staff    48147833 May 25 00:00 dap-data-dir.tgz
-rw-rw-r--+  1 dap      staff    10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+  1 dap      staff    25900155 May 24 23:52 omicron-3207.tgz

Finally, the shell history for the CockroachDB zone is there now too:

dap@catacomb /data/staff/core/issues/omicron-3207 $ ls -l
total 163828
-rw-rw-r--+  1 dap      staff    48147833 May 25 00:00 dap-data-dir.tgz
-rw-rw-r--+  1 dap      staff    10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+  1 dap      staff    25900155 May 24 23:52 omicron-3207.tgz
-rw-rw-r--+  1 dap      staff      13362 May 25 00:01 shell_history_cockroachdb.txt

davepacheco commented 1 year ago

I confirmed these observations that @askfongjojo and @gjcolombo found:

that Nexus reports having executed a particular saga to remove an instance
there's no evidence in the database right now that that saga was ever run
there's also no evidence in the database right now that that instance was deleted (its state is "running" and "time_deleted" is NULL). The last modification time on the instance is 2023-05-23T21:57:24Z, with the state updated at 21:57:44Z.

I have also checked:

There are a lot of other sagas in the database, so it's not like this just isn't being persisted at all
There are lots of new sagas in the database well after this issue happened, so it's not like it just stopped persisting data at some point
There is another saga referencing this instance -- it's the one that created it. So it's not like we're looking at the wrong database.
There's (almost) no inconsistency between the sagas in saga vs. saga_node_event, so whatever behavior this is seems to be consistent on a per-saga basis
zpool history does not suggest any zfs rollback or the like (and there are no snapshots around anyway)
I don't see any open database transactions in SHOW TRANSACTIONS (which makes it seem less likely that this data is sitting in some uncommitted transaction)
I currently see four TCP connections established between Nexus and CockroachDB, and both sides report them the same way (which makes the rogue CockroachDB on the same network seem less likely)
I tried checking the instance state using AS OF SYSTEM TIME with timestamps around the event as well as around now and the state is unchanged (rules out that the system clock is somehow unintentionally selecting an older version)
I checked that the commit reported in uname -a on this system appears to contain the fix for illumos#15254 (the infamous ymm issue).
I checked for a relevant CockroachDB technical advisory. I don't see any for our current version (v22.1.9) that looks like this (though there are some scary ones).

But there's also a bunch of weirdness here:

Nexus reports "communication errors" -- mostly timeouts? maybe all to a known-down sled?
slog-async reports having dropped a bunch of messages (this should be better soon, as I bumped up the buffer size in oxidecomputer/dropshot#677)
The non-global zone time zone problem (oxidecomputer/helios-omicron-brand#4)
We don't know for sure why CockroachDB restarted right before this. It looks like this was done by someone working in the zone using kill(1). This seems like it should not affect this behavior but it's suspicious that it had just happened.
Cockroach reports a bunch of events taking over 300s right around the time of the incident, starting shortly after it restarted. This is suspicious (but shouldn't explain data corruption, I think?) Stuff like:

W230523 22:42:36.873449 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ⋮ [n1,client=[fd00:1122:3344:106::4]:60196,user=root] 315  slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [‹/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0›,‹/Min›), [txn: 94501441], [can-forward-ts] to r149:‹/Table/19{1-4}› [(n1,s1):1, next=2, gen=56]; resp: ‹(err: <nil>), *roachpb.GetResponse›

I also discovered that bb8 has an error_sink that we could be using to log errors. This might help understand how Nexus responded to the CockroachDB restart.

davepacheco commented 1 year ago

Tried again with tar's E flag and there's now a dap-data-dir-take2.tgz file in that directory that should have all the files.

davepacheco commented 1 year ago

On some level, there are basically three possibilities:

The INSERTs/UPDATEs never really happened (i.e., the server never acknowledged them to the client). I've gone through the code path that inserts saga records and they do seem to propagate errors correctly, but it's always possible I've missed something. It may be worth digging a bit into what the commit semantics are for the normal path where we just issue queries without BEGIN/COMMIT. If I remember right, PostgreSQL calls this autocommit, where each statement becomes its own transaction, but it's possible there's some other behavior configured here on the CockroachDB, Diesel, bb8, or omicron sides. Maybe we're issuing a query that's completing but not committing, then Cockroach restarts, the connection gets dropped without being committed?
Something undid them. (This seems unlikely. I don't think there's any code that deletes anything from the saga tables. I also checked the Cockroach console looking for delete queries in the last two days and did not find any. I also don't think there's any code to undo a soft delete on the instance table, though it's certainly possible something is updating that record without proper synchronization.)
CockroachDB is returning the wrong data. I wish we had more direct evidence of this.

davepacheco commented 1 year ago

The recent discussions around async cancellation got me wondering if that behavior could cause something like we're seeing here. Diesel itself is sync, so if there's a cancellation problem on the database side of Nexus, it'd presumably have to be in async-bb8-diesel. I did find something a little fishy in transaction_async. Suppose something in the Nexus DataStore invoked transaction_async. The usual path would be:

From Nexus at-large, enter a DataStore method.
Use DataStore::pool_authorized() to get back a reference to the underlying pool.
Invoke pool.transaction_async(...), which is AsyncConnection::transaction_async(). Note that AsyncConnection is impl'd directly for the Pool. I found the code flow after this point a little confusing so I'm writing this down: inside transaction_async(): a. The code calls self.get_owned_connection(), which, for a Pool, ends up calling bb8::Pool::get_owned(). This returns a static-lifetime object called bb8::PooledConnection that wraps async-bb8-diesel's own Connection type and impls Deref for it. It looks to me like we will have exclusive access to the connection until the PooledConnection is dropped. This is important! b. The code then calls run_with_shared_connection, which calls as_sync_conn() on the async_bb8_diesel::Connection. This acquires a mutex, getting out whatever connection type async_bb8_diesel::Connection itself wraps. c. Still inside run_with_shared_connection, with the underlying connection lock held and a reference to the underlying connection type, we execute the SQL to begin a transaction. After that, we drop the lock. At this point I worried that something else could come in and execute other SQL on this connection. I don't believe this is possible for the reason mentioned above: the pool considers this connection in-use and will not give it out to anybody else. (I'm not sure why we need the Mutex here.) d. It's instructive to note that on failure at this point, we return from transaction_async(). This should cause the PooledConnection to be dropped and the connection returned back to the pool. I would imagine that in most cases, we failed because we failed to execute the "begin transaction" SQL for some reason and so the connection ought to be in a pretty normal state. I'm not positive this is always safe, though: suppose we did carry out the "begin transaction" but then failed on some other kind of error? Then we will have returned a connection to the pool that is still in a transaction. e. Assuming we didn't fail: we call as_async_conn() on the PooledConnection to deref it to the underlying connection type and wrap that in an async_bb8_diesel::Connection. I think at this point we've got an async_bb8_diesel::Connection wrapping another async_bb8_diesel::Connection wrapping the real connection type but I am not positive about that. Anyway, we finally invoke the async Rust function that the application (Nexus) provided to do whatever it is we wanted to do inside a transaction. f. Critically: there's an await point here. If we were cancelled at this point, we'd wind up dropping a bunch of these connection wrappers, including the PooledConnection. This would cause the connection to get returned the pool... but we have neither committed nor rolled back the transaction! g. At this point, I fear that if anything else ever acquired this connection from the pool, it could run a bunch of SQL and get results back and believe that it had fully carried out a bunch of SQL operations, when in fact it was all still provisional on the outer transaction being committed. And that's never going to happen because we've lost track of the fact that we're in a transaction in the first place. Note that this can all happen even if the victim code path itself tries to use a transaction because transactions can be nested, creating savepoints. When the victim code thinks it's committing its transaction, it's merely committing its savepoint. It still hasn't carried anything out because the outermost transaction remains open.

I didn't see anything in the code path that returns a PooledConnection to the pool that terminates any open transactions (or resets any other connection state, for that matter).

At this point this is all a guess based on reading the code. I don't see what would prevent it from happening. Could it have actually happened here? To simplify things, what would have to happen is:

Nexus receives some request that starts a database transaction.
The client disconnects and things unwind as described above, returning a database connection to the pool that's still got a transaction open.
Some subsequent operations (e.g., Steno goes to write saga and saga_node records) wind up acquiring this connection, doing all their work, hitting no errors. In fact, nothing's been changed in the database because the transaction is still open.

I don't see any reason yet why this couldn't have happened. The next question is whether this would leave any evidence that we could look for? I will go back through the data I collected above and see if any of it is inconsistent with this explanation.

davepacheco commented 1 year ago

The only thing I've found above that might be inconsistent with this explanation was that I wrote:

I don't see any open database transactions in SHOW TRANSACTIONS (which makes it seem less likely that this data is sitting in some uncommitted transaction)

but it's not impossible. I was investigating many hours after this had happened. We hadn't done #3232 yet so we'd have no log entry if connections had been closed and re-established for whatever reason. I also don't think we'd expect any log entries on cancellation, at least not in async-bb8-diesel nor dropshot. (We should add them in Dropshot at least, but after oxidecomputer/dropshot#701 this can't happen in Nexus anyway.)

davepacheco commented 1 year ago

I have written a much simpler program using Dropshot, diesel, and async-bb8-diesel to reproduce the failure mode I described above. It's exactly as feared. It leads to behavior that looks a lot like what we saw here: the program behaves in every way as though the data was written to the database, but if you inspect the database, the writes never made it there. I've filed oxidecomputer/async-bb8-diesel#47 for this.

Talking it over with @ahl we realized that there should be some evidence of this: there must have been some Dropshot request that was cancelled in transaction_async(). In that case, we ought to have log messages from whatever authz check were done for this request, and we ought not to have any log message showing that the HTTP request completed. I will try to look for this later today or tomorrow.

davepacheco commented 1 year ago

I think this might have happened rather a lot of times. So many that I doubted my methodology, but I think this might actually make sense.

I unpacked the Nexus log files we saved earlier and summarized the request ids from "authorize result" log messages:

$ grep -h '"authorize result"' oxide-nexus\:default.log*  | json -ga req_id > authorize_req_ids
$ sort authorize_req_ids > authorize_req_ids_sorted

Then I dumped request ids from "request completed" messages:

$ grep -h 'request completed' oxide-nexus\:default.log* | json -ga req_id | sort -u > done_request_ids

Then I used comm to grab the ones in the first bucket but not the second, and for each one, grabbed the "authorize result" log entries and pulled out the HTTP method and URI to see what was going on:

 $ comm -23 authorize_req_ids_sorted done_request_ids | grep -v '^$' | while read req_id; do grep -h $req_id oxide-nexus\:default.log* | grep '"authorize result"' | json -ga req_id method uri;  done | sort -u
071d5877-f36d-46e3-a929-400099e8113f POST //v1/instances?project=try
08ff808c-0083-449d-86af-bf56e5151cc8 POST //v1/instances?project=try
0a383bd0-60e4-4178-a74a-69ccd373b8f0 POST //v1/instances?project=try
0b304ae3-dc3f-4ddd-b4a6-70d10591cb7c POST //v1/instances?project=try
0c6e42e8-e196-40fa-8956-78755c6f2bbd GET https://venus.oxide-preview.com/v1/vpcs/a4ef8295-0150-4d9f-a719-e7ac1cf0c236
12058a3c-9952-4e3f-9af0-14263cbe79d8 POST //v1/instances?project=try
17df4b7e-f39a-48eb-83aa-7ab05a456c69 POST /v1/snapshots?project=alan
1905c7e0-a7b2-4502-9778-a3333f88495d POST //v1/instances?project=try
277cb6fa-4ffb-434f-8257-1b3874ce739e POST //v1/instances?project=try
2843e388-9adf-4b3a-8764-79f7ec4a364f GET https://venus.oxide-preview.com/v1/system/metrics/cpus_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
312d44f4-506b-4d41-ad9d-5a9b3145a369 POST /v1/instances?project=ni
3fa23e80-cf80-47e3-80a2-2fccd1cada3b POST //v1/disks?project=try
4fe29269-a4c8-4ad2-840f-5bb3ade8c6ef POST //v1/instances?project=try
53f59f57-8ed2-40b9-8056-8100fd1c6658 POST /v1/instances?project=alan
554b4ffb-c120-425e-9b8d-e9b3602e759f GET https://venus.oxide-preview.com/v1/instances/ekke/external-ips?project=andyway
5a6529e0-2bb2-49a0-a64d-23a3cc689ff3 GET https://venus.oxide-preview.com/v1/system/metrics/virtual_disk_space_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
5e25eb09-9319-43be-ad65-80b2545a2b96 POST //v1/instances?project=try
6a23ea6d-2f85-4193-86a2-e586c7f7bfa4 POST //v1/instances?project=try
7536bded-024f-4d2b-8b8a-242c6745e05a GET https://venus.oxide-preview.com/v1/system/metrics/ram_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
7588cfc7-bf97-472c-bb39-143043cf258c POST /v1/instances?project=ni
76369d33-9de7-4c34-8f60-2acdf747a112 POST //v1/instances?project=try
82788e0d-f141-4b8c-bc8c-fdf2501e2554 DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway
83c58eb7-1fe6-4bcf-b878-f5748b5ae076 POST //v1/instances?project=try
95c259a0-9fb2-4d23-9827-ebb0001b4923 POST /v1/instances?project=ni
acf5b610-a6e2-4636-b060-4194bbd16da7 POST //v1/instances?project=try
ae6f52fc-9211-426f-b20f-468835755046 POST //v1/instances?project=try
b2b56af1-06cc-41a9-8cca-0e04d4c19786 POST https://venus.oxide-preview.com/v1/instances/ekkke32/start?project=ni
b4fcff81-f8f2-4e69-9e79-aba35a27d274 POST //v1/instances?project=try
b9fae27b-4e1f-43e2-b0ca-793b9498345a POST //v1/instances?project=try
c1d87a62-0beb-453e-8245-290d50361a75 DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway
c37b6312-446a-4daa-8c0d-bd82e055f793 POST //v1/instances?project=try
c383c8cc-3756-4765-83fd-ef852e8304ab GET https://venus.oxide-preview.com/v1/vpc-subnets/39dd79e5-78cb-4c7d-b35d-e19ade6d6a71
c40ba69f-48c7-44b2-a26f-0496334a2718 POST //v1/instances?project=try
ce83b28a-66f7-4827-abfa-a8d91ca77f9b POST //v1/instances?project=try
d6cd6013-0e42-4322-ad89-5683368bdeb4 POST //v1/instances?project=try
dc42d065-48cb-4882-b059-6493d25e1b5b POST //v1/instances?project=try
dd86b5aa-4906-4a4b-8d80-ff8ea4d47b14 POST /v1/instances?project=ni
e2ac1ec6-f8ba-4064-ad5d-cbb613a1980e POST /v1/snapshots?project=alan
e3450323-99c8-448d-87a0-44b709496aa8 POST //v1/instances?project=try
e5587b8f-4436-4dcc-bfb1-731b1a693ab6 POST //v1/instances?project=try
e7f34a9a-682f-4ac7-8f42-5accba540aaf POST /v1/instances?project=alan
ea9bbce9-8c28-4bc2-8e63-fb37e8974013 POST //v1/disks?project=try
f258e9d4-73cd-4b2d-b2d7-2604e6b969fb POST //v1/instances?project=try
f45d2a5e-dc11-4fd7-98ef-483334a8a639 POST //v1/instances?project=try
fb36dfd8-4dd9-4d16-a32d-66e7a5e1db1b DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway

That's way more than I expected. But these are mostly pretty long-running operations like creating and destroying Instances, Snapshots, etc. @ahl points out that there was a period where the CLI timeout was generally too short. It seems like that could cause this problem if any of those things were using transaction_async (and not in a saga).

davepacheco commented 1 year ago

Okay, there's one call to transaction_async() inside DataStore::role_asgn_list_for(), which is called during authorization for any authorized request. So I think this means almost any cancelled request could trigger this. I don't think we'd have log entries to confirm this because we only have "debug"-level logging here, so we don't have the message that gets logged when authorization starts.

askfongjojo commented 1 year ago

there was a period where the CLI timeout was generally too short

I can confirm that the CLI requests for create instance all hit the operation timed out error back when I filed this issue. Those requests would continue and succeed eventually.

I also hit Ctrl-C at least twice to terminate some API calls. I don't know if this had the same end result as a request timing-out (that the async process already had a life of its own and wouldn't cleanly cancel).