Open askfongjojo opened 1 year ago
The instance state transitions above and in the logs look more or less correct to me, although I'm similarly at a loss as to why the saga
table query didn't return the deletion saga.
It's currently expected that an Instance
record will continue to have an "active" sled ID and Propolis ID even when the Propolis zone has been torn down; #2315 proposes to change that behavior. It otherwise looks like the system behaved as intended--Nexus eventually moved the instance to Stopped, and the deletion saga successfully ran to completion; after that point the instance's time_deleted
should have been set and the instance should have appeared to any attempt to query it to be deleted.
The saga execution log for the saga that tried to re-create the instance, on the other hand, has this suspicious-looking entry:
{"msg":"recording saga event","v":0,"name":"nexus","level":20,"time":"2023-05-23T15:42:40.371974129-07:00","hostname":"oxz_nexus","pid":4365,"component":"SecStore","component":"nexus","component":"ServerContext","name":"496eabae-738d-4497-a183-1b75bf912c7c","event_type":"Failed(ActionFailed { source_error: Object {\"ObjectAlreadyExists\": Object {\"object_name\": String(\"sysbench-mysql-7\"), \"type_name\": String(\"disk\")}} })","node_id":"2","saga_id":"dec6089b-5291-42ca-ba84-6874c2dd44a0"}
This suggests the already-extant object is a disk named sysbench-mysql-7
and not an instance. I don't know of a mechanism to ask Nexus to delete an instance's attached disks on instance deletion, so I think this is expected if the two instance create requests both asked to create a disk with this name.
@askfongjojo What mechanism did you use to try to re-create the instance? (Nexus doesn't appear to log the bodies of incoming HTTP requests, and I can't quite work out what value the "disks" parameter to the create request had in this case.) The instance creation API allows callers to specify whether they want to create a new disk or merely attach an existing one (in the InstanceDiskAttachment
component); if you asked to "create" the disk both times, switching the re-creation attempt's disk request type to "attach" might resolve the issue.
This suggests the already-extant object is a disk named
sysbench-mysql-7
and not an instance. I don't know of a mechanism to ask Nexus to delete an instance's attached disks on instance deletion, so I think this is expected if the two instance create requests both asked to create a disk with this name.
Yeah, this can be confusing as I gave the disk the same name as the instance. The script I used creates both disk and instance and I just reran it as-is, expecting the disk creation to fail while the instance creation to succeed. This is something I've done many times last week with a loop count of 3; the only difference this time is that I have 12:
#!/bin/bash
export OXIDE_HOST=http://172.20.26.2/
export OXIDE_TOKEN=$1
if [[ -z $2 ]]; then
app=sysbench-mysql
else
app=$2
fi
for i in {1..12}; do
oxide api '/v1/disks?project=try' --method POST --input - <<EOF
{
"name": "$app-$i",
"description": "ubuntu 22.04 mysql instance $i",
"size": $(( 1024*1024*1024*100 )),
"disk_source": {
"type": "image",
"image_id": "8ce4e6e9-afa6-4a6b-ab67-71d720655047"
}
}
EOF
oxide api '/v1/instances?project=try' --method POST --input - <<EOF
{
"name": "$app-$i",
"description": "jammy load generator",
"hostname": "$app$i",
"ncpus": 4,
"memory": $(( 1024*1024*1024*16 )),
"disks": [
{
"type": "attach",
"name": "$app-$i"
}
],
"network_interfaces": {
"type": "default"
},
"external_ips": [
{
"type": "ephemeral",
"pool_name": "default"
}
],
"user_data": "$(cat test.yaml | base64)",
"start": true
}
EOF
done
The time_deleted is NULL for this instance so the saga has strangely declare done without updating the database, or actually updated the instance state back to running
somehow.
root@[fd00:1122:3344:106::6]:32221/omicron> select * from instance where name = 'sysbench-mysql-7';
id | name | description | time_created | time_modified | time_deleted | project_id | user_data | state | time_state_updated | state_generation | active_sled_id | active_propolis_id | active_propolis_ip | target_propolis_id | migration_id | propolis_generation | ncpus | memory | hostname
---------------------------------------+------------------+----------------------+------------------------------+------------------------------+--------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+------------------+--------------------------------------+--------------------------------------+------------------------+--------------------+--------------+---------------------+-------+-------------+------------------
eeb46abd-ab90-46e8-ad3e-20a45975bbbf | sysbench-mysql-7 | jammy load generator | 2023-05-23 21:57:24.31061+00 | 2023-05-23 21:57:24.31061+00 | NULL | 549f44d7-b0cd-4532-b94d-9cae7baab5d3 | #cloud-config\nmanage_etc_hosts: true\nruncmd:\n- \'curl https://gist.githubusercontent.com/askfongjojo/119fee7e6f53a43933342763b3428c04/raw/903cf93c021910545625377a6eb9a326d0869531/sysbench-userdata | /bin/bash -s\'\n- \'curl -o /home/ubuntu/sysbench-mysql-test.sh https://gist.githubusercontent.com/askfongjojo/6632a3f110d1189c30ce2190ff93e46e/raw/b50fdf52fb92fe970c50e4c4aad50d210065472b/sysbench-mysql-test\'\n- chmod +x /home/ubuntu/sysbench-mysql-test.sh\n- chown ubuntu:ubuntu /home/ubuntu/sysbench-mysql-test.sh\n- [ sh, -xc, "nohup /usr/share/sysbench/tpcc.lua --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=password --mysql-db=sbtest --time=1800 --threads=8 --report-interval=60 --tables=3 --scale=10 --db-driver=mysql prepare" ]\n- echo \'********** custom setup completed **********\'\n | running | 2023-05-23 21:57:44.642639+00 | 4 | aa7c82d9-6e59-406e-b1e3-3648890b4bec | 9e88dbaa-822c-425b-b566-6b843956a4c3 | fd00:1122:3344:103::24 | NULL | NULL | 1 | 4 | 17179869184 | sysbench-mysql7
(1 row)
The real problem is that there is no way to stop the instance to make any changes.
Ah, I realize my bug description was unclear about the issue. I've updated it to indicate the split-brain issue.
Thanks for the extra details! I also realized just now that I missed the Nexus log lines that show that there was an instance-already-exists conflict when trying to re-create sysbench-mysql-7
. Mea culpa for that one; I'm not sure why I concluded the disk conflict was the interesting bit.
CRDB's apparent behavior here scares me. The instance record in your comment here has a state_generation
of 4, which is what I'd expect for a brand-new instance that's running for the very first time and hasn't changed state since it began to run. But the logs indicate that Nexus thinks this instance is stopped and that it successfully wrote that state change back to the database.
If I'm understanding things right, that means we've got two separate cases where Nexus recorded that it did something (wrote 'stopping'/'stopped' states for the instance, executed an instance delete saga) that should have been visible in CRDB, and two separate lines of observation (manual queries of CRDB, Nexus thinking the instance isn't deleted) that indicate that those writes aren't visible to subsequent reads. I can't imagine anything is going to work if CRDB's not persisting things Nexus thinks it's persisting.
If all this is correct, I think the next step here would be to see if the CRDB zone has logs or other context that indicate that it's having a problem.
I have functionally zero CockroachDB expertise but do see the following unusual events in its logs from around the time the instance re-creation attempt failed:
W230523 22:42:36.866723 938 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 297 pusher ~@6a43640d~@: finished waiting after 329.98s for pushee ~@fea211f7~@
W230523 22:42:36.866769 3896 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 299 pusher ~@1786d14a~@: finished waiting after 329.98s for pushee ~@fea211f7~@
W230523 22:42:36.866755 1953 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 298 pusher ~@8dd4557c~@: finished waiting after 330.03s for pushee ~@fea211f7~@
W230523 22:42:36.866788 1901 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 300 pusher ~@3f1e1039~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866808 1965 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 301 pusher ~@64676875~@: finished waiting after 320.00s for pushee ~@fea211f7~@
W230523 22:42:36.866838 1967 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 302 pusher ~@94501441~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866830 1950 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 303 pusher ~@5b00071a~@: finished waiting after 320.01s for pushee ~@fea211f7~@
W230523 22:42:36.866883 3802 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 304 pusher ~@d7275bcf~@: finished waiting after 308.83s for pushee ~@fea211f7~@
W230523 22:42:36.866892 1425 kv/kvserver/txnwait/queue.go:574 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 305 pusher ~@97ded7a6~@: finished waiting after 320.00s for pushee ~@fea211f7~@
W230523 22:42:36.866978 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 306 slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(94501441->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867010 1953 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 308 slow range RPC: have been waiting 330.03s (1 attempts) for RPC PushTxn(8dd4557c->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.866994 1950 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 309 slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(5b00071a->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867045 3896 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 310 slow range RPC: have been waiting 329.98s (1 attempts) for RPC PushTxn(1786d14a->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867004 3802 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 307 slow range RPC: have been waiting 308.83s (1 attempts) for RPC PushTxn(d7275bcf->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.866999 1965 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 311 slow range RPC: have been waiting 320.00s (1 attempts) for RPC PushTxn(64676875->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.866992 938 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 312 slow range RPC: have been waiting 329.98s (1 attempts) for RPC PushTxn(6a43640d->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~@@
[(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867101 1425 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 313 slow range RPC: have been waiting 320.00s (1 attempts) for RPC PushTxn(97ded7a6->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.867113 1901 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,s1,r149/1:~@/Table/19{1-4}~@] 314 slow range RPC: have been waiting 320.01s (1 attempts) for RPC PushTxn(3f1e1039->fea211f7) [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@) to r149:~@/Table/19{1-4}~~
@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.PushTxnResponse~@
W230523 22:42:36.873449 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:60196,user=root] 315 slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 94501441], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.873921 1953 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:61138,user=root] 316 slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 8dd4557c], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.876522 3896 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:46871,user=root] 317 slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 1786d14a], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.876781 1901 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:33865,user=root] 318 slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 3f1e1039], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.876954 3802 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:44855,user=root] 319 slow range RPC: have been waiting 308.95s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: d7275bcf], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.877093 1950 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:55201,user=root] 320 slow range RPC: have been waiting 320.13s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 5b00071a], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@ nse~@
W230523 22:42:36.878565 938 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:41806,user=root] 321 slow range RPC: have been waiting 330.09s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 6a43640d], [can-forward-ts] to rr
149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.880642 1965 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:40753,user=root] 322 slow range RPC: have been waiting 320.11s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 64676875], [can-forward-ts] to
r149:~@/Table/19{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: <nil>), *roachpb.GetResponse~@
W230523 22:42:36.882630 1425 kv/kvclient/kvcoord/dist_sender.go:1611 ~K [n1,client=[fd00:1122:3344:106::4]:47622,user=root] 323 slow range RPC: have been waiting 320.11s (1 attempts) for RPC Get [~@/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0~@,~@/Min~@), [txn: 97ded7a6] to r149:~@/Table/199
{1-4}~@ [(n1,s1):1, next=2, gen=56]; resp: ~@(err: WriteTooOldError: write for key /Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0 at timestamp 1684881426.741633018,0 too old; wrote at 1684881753.880126241,3: "sql txn" meta={id=97ded7a6 key=/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0 pri==
0.10225726 epo=0 ts=1684881753.880126241,3 min=1684881426.741633018,0 seq=0} lock=true stat=PENDING rts=1684881426.741633018,0 wto=false gul=1684881427.241633018,0)~@
To be clear, I have no idea what any of this really means or what was being waited on. But it looks like this is one of only two time ranges in the logs where "slow range RPC" errors occurred (the other was around 21:01:43Z), and the 320-second interval preceding these messages would have included the times where we tried to stop and delete these instances.
Yes, the saga table recorded nothing for a whole 6-minute window when there were many orchestration activities:
root@[fd00:1122:3344:106::6]:32221/omicron> select time_created, adopt_time, saga_state from saga where date_trunc('second', "time_created")>= '2023-05-23 22:35:00' order by 1;
time_created | adopt_time | saga_state
--------------------------------+-------------------------------+-------------
2023-05-23 22:36:13.380571+00 | 2023-05-23 22:36:13.380571+00 | done
2023-05-23 22:42:37.581953+00 | 2023-05-23 22:42:37.581953+00 | done
2023-05-23 22:42:37.900628+00 | 2023-05-23 22:42:37.900628+00 | done
2023-05-23 22:42:38.25941+00 | 2023-05-23 22:42:38.25941+00 | done
...
The crdb smf service log showed that it started again at 22:35:41 though there wasn't anything that indicated why. The timestamps are very confusing though, as if it uses local TZ in one place and UTC timestamp in another.
CockroachDB node starting at 2023-05-23 16:54:12.092161156 +0000 UTC (took 1.5s)
build: OSS v22.1.9-dirty @ 2022/10/26 21:17:46 (go1.17.13)
webui: http://[fd00:1122:3344:106::6]:8080
sql: postgresql://root@[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable
sql (JDBC): jdbc:postgresql://[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable&user=root
RPC client flags: /opt/oxide/cockroachdb/bin/cockroach <client cmd> --host=[fd00:1122:3344:106::6]:32221 --insecure
logs: /data/logs
temp dir: /data/cockroach-temp598479943
external I/O path: /data/extern
store[0]: path=/data
storage engine: pebble
clusterID: 3ca480c2-56af-46cf-a6d0-72dfebc97ba6
status: initialized new cluster
nodeID: 1
initiating graceful shutdown of server
server drained and shutdown completed
[ May 23 15:35:40 Stopping because all processes in service exited. ]
[ May 23 15:35:40 Executing stop method (:kill). ]
[ May 23 15:35:40 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
+ set -o errexit
+ set -o pipefail
+ . /lib/svc/share/smf_include.sh
++ SMF_EXIT_OK=0
++ SMF_EXIT_NODAEMON=94
++ SMF_EXIT_ERR_FATAL=95
++ SMF_EXIT_ERR_CONFIG=96
++ SMF_EXIT_MON_DEGRADE=97
++ SMF_EXIT_MON_OFFLINE=98
++ SMF_EXIT_ERR_NOSMF=99
++ SMF_EXIT_ERR_PERM=100
++ svcprop -c -p config/listen_addr svc:/oxide/cockroachdb:default
+ LISTEN_ADDR=fd00:1122:3344:106::6
++ svcprop -c -p config/listen_port svc:/oxide/cockroachdb:default
+ LISTEN_PORT=32221
++ svcprop -c -p config/store svc:/oxide/cockroachdb:default
+ DATASTORE=/data
++ svcprop -c -p config/datalink svc:/oxide/cockroachdb:default
+ DATALINK=oxControlService2
++ svcprop -c -p config/gateway svc:/oxide/cockroachdb:default
+ GATEWAY=fd00:1122:3344:106::1
+ [[ oxControlService2 == unknown ]]
+ [[ fd00:1122:3344:106::1 == unknown ]]
+ ipadm show-addr oxControlService2/ll
ADDROBJ TYPE STATE ADDR
oxControlService2/ll addrconf ok fe80::8:20ff:fe49:6741%oxControlService2/10
+ ipadm show-addr oxControlService2/omicron6
ADDROBJ TYPE STATE ADDR
oxControlService2/omicron6 static ok fd00:1122:3344:106::6/64
+ route get -inet6 default -inet6 fd00:1122:3344:106::1
route to: default
destination: default
mask: default
gateway: fd00:1122:3344:106::1
interface: oxControlService2
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,ms rttvar,ms hopcount mtu expire
0 0 0 0 0 0 9000 0
+ args=('--insecure' '--listen-addr' "[$LISTEN_ADDR]:$LISTEN_PORT" '--store' "$DATASTORE")
+ exec /opt/oxide/cockroachdb/bin/cockroach start-single-node --insecure --listen-addr '[fd00:1122:3344:106::6]:32221' --store /data
[ May 23 15:35:40 Method "start" exited with status 0. ]
*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
*
* This mode is intended for non-production testing only.
*
* In this mode:
* - Your cluster is open to any client that can access fd00:1122:3344:106::6.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
*
* - https://go.crdb.dev/issue-v/53404/v22.1
* - https://www.cockroachlabs.com/docs/v22.1/secure-a-cluster.html
*
CockroachDB node starting at 2023-05-23 22:35:41.741418229 +0000 UTC (took 1.0s)
build: OSS v22.1.9-dirty @ 2022/10/26 21:17:46 (go1.17.13)
webui: http://[fd00:1122:3344:106::6]:8080
sql: postgresql://root@[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable
sql (JDBC): jdbc:postgresql://[fd00:1122:3344:106::6]:32221/defaultdb?sslmode=disable&user=root
RPC client flags: /opt/oxide/cockroachdb/bin/cockroach <client cmd> --host=[fd00:1122:3344:106::6]:32221 --insecure
logs: /data/logs
temp dir: /data/cockroach-temp1934682193
external I/O path: /data/extern
store[0]: path=/data
storage engine: pebble
clusterID: 3ca480c2-56af-46cf-a6d0-72dfebc97ba6
status: restarted pre-existing node
nodeID: 1
There is some matching info from another log file:
root@oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd:~# tail -10 /data/logs/cockroach.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log
W230523 22:35:40.269931 437285 sql/sqlstats/persistedsqlstats/flush.go:116 ⋮ [n1,server drain process] 15760 ‹failed to flush statement statistics›: flush statement ‹4025915790256006928›'s statistics: ‹insert-stmt-stats›: cannot acquire lease when draining
W230523 22:35:40.270375 437285 sql/sqlstats/persistedsqlstats/flush.go:116 ⋮ [n1,server drain process] 15761 ‹failed to flush transaction statistics›: flushing transaction ‹11006041404254834895›'s statistics: ‹insert-txn-stats›: cannot acquire lease when draining
I230523 22:35:40.272237 437285 1@server/drain.go:285 ⋮ [n1,server drain process] 15762 drain remaining: 0
W230523 22:35:40.273187 438084 kv/kvserver/intentresolver/intent_resolver.go:793 ⋮ [-] 15763 failed to gc transaction record: could not GC completed transaction anchored at ‹/Table/43/1/1/2023-05-23T22:00:00Z/"D0\x80K\x8a^_\x84"/"$ internal-find-scheduled-jobs"/1›: ‹node unavailable; try another peer›
I230523 22:35:40.273769 970 sql/stats/automatic_stats.go:489 ⋮ [n1] 15764 quiescing auto stats refresher
W230523 22:35:40.273824 449 sql/sqlliveness/slinstance/slinstance.go:243 ⋮ [n1] 15765 exiting heartbeat loop
I230523 22:35:40.273946 224 jobs/registry.go:1205 ⋮ [n1] 15766 AUTO SPAN CONFIG RECONCILIATION job 867665352511619073: stepping through state succeeded with error: <nil>
W230523 22:35:40.273749 975 jobs/registry.go:811 ⋮ [n1] 15767 canceling all adopted jobs due to stopper quiescing
W230523 22:35:40.274424 224 kv/txn.go:728 ⋮ [n1] 15768 failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: ‹log-job›: context canceled
I230523 22:35:40.281730 1 1@cli/start.go:899 ⋮ [n1] 15769 server drained and shutdown completed
I spent this afternoon looking at this without too much to show for it. I'll summarize what I did find below. Since I'm about out of time and I believe this system will be shutdown tonight for other reasons, I figured I'd collect what I could. Here are log files from Nexus and CockroachDB, plus a cockroach debug zip
file:
dap@catacomb /data/staff/core/issues/omicron-3207 $ ls -l
total 71682
-rw-rw-r--+ 1 dap staff 10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+ 1 dap staff 25900155 May 24 23:52 omicron-3207.tgz
These were collected on gc19 (BRM42220017) using:
$ tar czvf omicron-3207.tgz /zone/oxz_nexus/root/var/svc/log /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/var/svc/log
and
/opt/oxide/cockroachdb/bin/cockroach debug zip --host '[fd00:1122:3344:106::6]' --port 32221 --insecure /data/dap-debug-zip.zip
I also tried to collect the whole Cockroach store directory. I can't tell if it worked:
BRM42220017 # tar czf /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/dap-data-dir.tgz /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data
tar: cockroach-health.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_41Z.004049.log: filename is greater than 100
tar: cockroach-health.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_12Z.007568.log: filename is greater than 100
tar: cockroach-pebble.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-security.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-telemetry.log: symbolic link too long
tar: cockroach-pebble.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_40Z.004049.log: filename is greater than 100
tar: cockroach-stderr.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_40Z.004049.log: filename is greater than 100
tar: cockroach-sql-schema.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-stderr.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-pebble.log: symbolic link too long
tar: cockroach-stderr.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_11Z.007568.log: filename is greater than 100
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-24T06_37_42Z.004049.log: filename is greater than 100
tar: cockroach-security.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T16_54_12Z.007568.log: filename is greater than 100
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-sql-schema.log: symbolic link too long
tar: /pool/ext/49d7dcae-9a35-43e6-87b6-97648a9d72cd/zone/oxz_cockroachdb_oxp_49d7dcae-9a35-43e6-87b6-97648a9d72cd/root/data/logs/cockroach-health.log: symbolic link too long
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-23T22_35_41Z.004049.log: filename is greater than 100
tar: cockroach-telemetry.oxzcockroachdboxp49d7dcae-9a35-43e6-87b6-97648a9d72cd.root.2023-05-24T22_39_41Z.004049.log: filename is greater than 100
BRM42220017 # echo $?
1
BRM42220017 #
I suspect we got the main logs but not the ones that were called out. Anyway, that's here now:
dap@catacomb ~ $ ls -l /data/staff/core/issues/omicron-3207/
total 163819
-rw-rw-r--+ 1 dap staff 48147833 May 25 00:00 dap-data-dir.tgz
-rw-rw-r--+ 1 dap staff 10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+ 1 dap staff 25900155 May 24 23:52 omicron-3207.tgz
Finally, the shell history for the CockroachDB zone is there now too:
dap@catacomb /data/staff/core/issues/omicron-3207 $ ls -l
total 163828
-rw-rw-r--+ 1 dap staff 48147833 May 25 00:00 dap-data-dir.tgz
-rw-rw-r--+ 1 dap staff 10777316 May 24 23:52 dap-debug-zip.zip
-rw-rw-r--+ 1 dap staff 25900155 May 24 23:52 omicron-3207.tgz
-rw-rw-r--+ 1 dap staff 13362 May 25 00:01 shell_history_cockroachdb.txt
I confirmed these observations that @askfongjojo and @gjcolombo found:
I have also checked:
saga
vs. saga_node_event
, so whatever behavior this is seems to be consistent on a per-saga basisSHOW TRANSACTIONS
(which makes it seem less likely that this data is sitting in some uncommitted transaction)AS OF SYSTEM TIME
with timestamps around the event as well as around now and the state is unchanged (rules out that the system clock is somehow unintentionally selecting an older version)uname -a
on this system appears to contain the fix for illumos#15254 (the infamous ymm issue).But there's also a bunch of weirdness here:
W230523 22:42:36.873449 1967 kv/kvclient/kvcoord/dist_sender.go:1611 ⋮ [n1,client=[fd00:1122:3344:106::4]:60196,user=root] 315 slow range RPC: have been waiting 320.12s (1 attempts) for RPC Get [‹/Table/191/1/"89369400ad040eb1d9b3ee5ca45b7ea85f647e31"/0›,‹/Min›), [txn: 94501441], [can-forward-ts] to r149:‹/Table/19{1-4}› [(n1,s1):1, next=2, gen=56]; resp: ‹(err: <nil>), *roachpb.GetResponse›
I also discovered that bb8 has an error_sink that we could be using to log errors. This might help understand how Nexus responded to the CockroachDB restart.
Tried again with tar's E
flag and there's now a dap-data-dir-take2.tgz
file in that directory that should have all the files.
On some level, there are basically three possibilities:
The recent discussions around async cancellation got me wondering if that behavior could cause something like we're seeing here. Diesel itself is sync, so if there's a cancellation problem on the database side of Nexus, it'd presumably have to be in async-bb8-diesel. I did find something a little fishy in transaction_async
. Suppose something in the Nexus DataStore
invoked transaction_async
. The usual path would be:
DataStore
method.DataStore::pool_authorized()
to get back a reference to the underlying pool.pool.transaction_async(...)
, which is AsyncConnection::transaction_async()
. Note that AsyncConnection
is impl'd directly for the Pool. I found the code flow after this point a little confusing so I'm writing this down: inside transaction_async()
:
a. The code calls self.get_owned_connection()
, which, for a Pool
, ends up calling bb8::Pool::get_owned()
. This returns a static-lifetime object called bb8::PooledConnection
that wraps async-bb8-diesel's own Connection
type and impls Deref
for it. It looks to me like we will have exclusive access to the connection until the PooledConnection
is dropped. This is important!
b. The code then calls run_with_shared_connection
, which calls as_sync_conn()
on the async_bb8_diesel::Connection
. This acquires a mutex, getting out whatever connection type async_bb8_diesel::Connection
itself wraps.
c. Still inside run_with_shared_connection
, with the underlying connection lock held and a reference to the underlying connection type, we execute the SQL to begin a transaction. After that, we drop the lock. At this point I worried that something else could come in and execute other SQL on this connection. I don't believe this is possible for the reason mentioned above: the pool considers this connection in-use and will not give it out to anybody else. (I'm not sure why we need the Mutex here.)
d. It's instructive to note that on failure at this point, we return from transaction_async()
. This should cause the PooledConnection
to be dropped and the connection returned back to the pool. I would imagine that in most cases, we failed because we failed to execute the "begin transaction" SQL for some reason and so the connection ought to be in a pretty normal state. I'm not positive this is always safe, though: suppose we did carry out the "begin transaction" but then failed on some other kind of error? Then we will have returned a connection to the pool that is still in a transaction.
e. Assuming we didn't fail: we call as_async_conn()
on the PooledConnection
to deref it to the underlying connection type and wrap that in an async_bb8_diesel::Connection
. I think at this point we've got an async_bb8_diesel::Connection
wrapping another async_bb8_diesel::Connection
wrapping the real connection type but I am not positive about that. Anyway, we finally invoke the async Rust function that the application (Nexus) provided to do whatever it is we wanted to do inside a transaction.
f. Critically: there's an await point here. If we were cancelled at this point, we'd wind up dropping a bunch of these connection wrappers, including the PooledConnection. This would cause the connection to get returned the pool... but we have neither committed nor rolled back the transaction!
g. At this point, I fear that if anything else ever acquired this connection from the pool, it could run a bunch of SQL and get results back and believe that it had fully carried out a bunch of SQL operations, when in fact it was all still provisional on the outer transaction being committed. And that's never going to happen because we've lost track of the fact that we're in a transaction in the first place. Note that this can all happen even if the victim code path itself tries to use a transaction because transactions can be nested, creating savepoints. When the victim code thinks it's committing its transaction, it's merely committing its savepoint. It still hasn't carried anything out because the outermost transaction remains open.I didn't see anything in the code path that returns a PooledConnection to the pool that terminates any open transactions (or resets any other connection state, for that matter).
At this point this is all a guess based on reading the code. I don't see what would prevent it from happening. Could it have actually happened here? To simplify things, what would have to happen is:
I don't see any reason yet why this couldn't have happened. The next question is whether this would leave any evidence that we could look for? I will go back through the data I collected above and see if any of it is inconsistent with this explanation.
The only thing I've found above that might be inconsistent with this explanation was that I wrote:
I don't see any open database transactions in SHOW TRANSACTIONS (which makes it seem less likely that this data is sitting in some uncommitted transaction)
but it's not impossible. I was investigating many hours after this had happened. We hadn't done #3232 yet so we'd have no log entry if connections had been closed and re-established for whatever reason. I also don't think we'd expect any log entries on cancellation, at least not in async-bb8-diesel nor dropshot. (We should add them in Dropshot at least, but after oxidecomputer/dropshot#701 this can't happen in Nexus anyway.)
I have written a much simpler program using Dropshot, diesel, and async-bb8-diesel to reproduce the failure mode I described above. It's exactly as feared. It leads to behavior that looks a lot like what we saw here: the program behaves in every way as though the data was written to the database, but if you inspect the database, the writes never made it there. I've filed oxidecomputer/async-bb8-diesel#47 for this.
Talking it over with @ahl we realized that there should be some evidence of this: there must have been some Dropshot request that was cancelled in transaction_async()
. In that case, we ought to have log messages from whatever authz check were done for this request, and we ought not to have any log message showing that the HTTP request completed. I will try to look for this later today or tomorrow.
I think this might have happened rather a lot of times. So many that I doubted my methodology, but I think this might actually make sense.
I unpacked the Nexus log files we saved earlier and summarized the request ids from "authorize result" log messages:
$ grep -h '"authorize result"' oxide-nexus\:default.log* | json -ga req_id > authorize_req_ids
$ sort authorize_req_ids > authorize_req_ids_sorted
Then I dumped request ids from "request completed" messages:
$ grep -h 'request completed' oxide-nexus\:default.log* | json -ga req_id | sort -u > done_request_ids
Then I used comm
to grab the ones in the first bucket but not the second, and for each one, grabbed the "authorize result" log entries and pulled out the HTTP method and URI to see what was going on:
$ comm -23 authorize_req_ids_sorted done_request_ids | grep -v '^$' | while read req_id; do grep -h $req_id oxide-nexus\:default.log* | grep '"authorize result"' | json -ga req_id method uri; done | sort -u
071d5877-f36d-46e3-a929-400099e8113f POST //v1/instances?project=try
08ff808c-0083-449d-86af-bf56e5151cc8 POST //v1/instances?project=try
0a383bd0-60e4-4178-a74a-69ccd373b8f0 POST //v1/instances?project=try
0b304ae3-dc3f-4ddd-b4a6-70d10591cb7c POST //v1/instances?project=try
0c6e42e8-e196-40fa-8956-78755c6f2bbd GET https://venus.oxide-preview.com/v1/vpcs/a4ef8295-0150-4d9f-a719-e7ac1cf0c236
12058a3c-9952-4e3f-9af0-14263cbe79d8 POST //v1/instances?project=try
17df4b7e-f39a-48eb-83aa-7ab05a456c69 POST /v1/snapshots?project=alan
1905c7e0-a7b2-4502-9778-a3333f88495d POST //v1/instances?project=try
277cb6fa-4ffb-434f-8257-1b3874ce739e POST //v1/instances?project=try
2843e388-9adf-4b3a-8764-79f7ec4a364f GET https://venus.oxide-preview.com/v1/system/metrics/cpus_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
312d44f4-506b-4d41-ad9d-5a9b3145a369 POST /v1/instances?project=ni
3fa23e80-cf80-47e3-80a2-2fccd1cada3b POST //v1/disks?project=try
4fe29269-a4c8-4ad2-840f-5bb3ade8c6ef POST //v1/instances?project=try
53f59f57-8ed2-40b9-8056-8100fd1c6658 POST /v1/instances?project=alan
554b4ffb-c120-425e-9b8d-e9b3602e759f GET https://venus.oxide-preview.com/v1/instances/ekke/external-ips?project=andyway
5a6529e0-2bb2-49a0-a64d-23a3cc689ff3 GET https://venus.oxide-preview.com/v1/system/metrics/virtual_disk_space_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
5e25eb09-9319-43be-ad65-80b2545a2b96 POST //v1/instances?project=try
6a23ea6d-2f85-4193-86a2-e586c7f7bfa4 POST //v1/instances?project=try
7536bded-024f-4d2b-8b8a-242c6745e05a GET https://venus.oxide-preview.com/v1/system/metrics/ram_provisioned?id=001de000-5110-4000-8000-000000000000&start_time=2023-05-23T19%3A56%3A17.274Z&end_time=2023-05-23T20%3A56%3A17.274Z
7588cfc7-bf97-472c-bb39-143043cf258c POST /v1/instances?project=ni
76369d33-9de7-4c34-8f60-2acdf747a112 POST //v1/instances?project=try
82788e0d-f141-4b8c-bc8c-fdf2501e2554 DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway
83c58eb7-1fe6-4bcf-b878-f5748b5ae076 POST //v1/instances?project=try
95c259a0-9fb2-4d23-9827-ebb0001b4923 POST /v1/instances?project=ni
acf5b610-a6e2-4636-b060-4194bbd16da7 POST //v1/instances?project=try
ae6f52fc-9211-426f-b20f-468835755046 POST //v1/instances?project=try
b2b56af1-06cc-41a9-8cca-0e04d4c19786 POST https://venus.oxide-preview.com/v1/instances/ekkke32/start?project=ni
b4fcff81-f8f2-4e69-9e79-aba35a27d274 POST //v1/instances?project=try
b9fae27b-4e1f-43e2-b0ca-793b9498345a POST //v1/instances?project=try
c1d87a62-0beb-453e-8245-290d50361a75 DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway
c37b6312-446a-4daa-8c0d-bd82e055f793 POST //v1/instances?project=try
c383c8cc-3756-4765-83fd-ef852e8304ab GET https://venus.oxide-preview.com/v1/vpc-subnets/39dd79e5-78cb-4c7d-b35d-e19ade6d6a71
c40ba69f-48c7-44b2-a26f-0496334a2718 POST //v1/instances?project=try
ce83b28a-66f7-4827-abfa-a8d91ca77f9b POST //v1/instances?project=try
d6cd6013-0e42-4322-ad89-5683368bdeb4 POST //v1/instances?project=try
dc42d065-48cb-4882-b059-6493d25e1b5b POST //v1/instances?project=try
dd86b5aa-4906-4a4b-8d80-ff8ea4d47b14 POST /v1/instances?project=ni
e2ac1ec6-f8ba-4064-ad5d-cbb613a1980e POST /v1/snapshots?project=alan
e3450323-99c8-448d-87a0-44b709496aa8 POST //v1/instances?project=try
e5587b8f-4436-4dcc-bfb1-731b1a693ab6 POST //v1/instances?project=try
e7f34a9a-682f-4ac7-8f42-5accba540aaf POST /v1/instances?project=alan
ea9bbce9-8c28-4bc2-8e63-fb37e8974013 POST //v1/disks?project=try
f258e9d4-73cd-4b2d-b2d7-2604e6b969fb POST //v1/instances?project=try
f45d2a5e-dc11-4fd7-98ef-483334a8a639 POST //v1/instances?project=try
fb36dfd8-4dd9-4d16-a32d-66e7a5e1db1b DELETE https://venus.oxide-preview.com/v1/instances/ekke?project=andyway
That's way more than I expected. But these are mostly pretty long-running operations like creating and destroying Instances, Snapshots, etc. @ahl points out that there was a period where the CLI timeout was generally too short. It seems like that could cause this problem if any of those things were using transaction_async
(and not in a saga).
Okay, there's one call to transaction_async()
inside DataStore::role_asgn_list_for()
, which is called during authorization for any authorized request. So I think this means almost any cancelled request could trigger this. I don't think we'd have log entries to confirm this because we only have "debug"-level logging here, so we don't have the message that gets logged when authorization starts.
there was a period where the CLI timeout was generally too short
I can confirm that the CLI requests for create instance all hit the operation timed out
error back when I filed this issue. Those requests would continue and succeed eventually.
I also hit Ctrl-C at least twice to terminate some API calls. I don't know if this had the same end result as a request timing-out (that the async process already had a life of its own and wouldn't cleanly cancel).
This happened when I used a script to invoke
oxide instance stop
for 10 instances successively in a loop, and then delete them with cli in another loop. The requests came back with success response (ignore the two destroyed instances - they will be covered in another ticket):However, these instances are now "shell" records in the instance table - the propolis zones are already gone from the sled:
I looked into one of the instances, sysbench-mysql-7 which is located on sled gc11. It was set to
stopped
at 22:40:03The zone was removed at 22:40:08 according to the sled-agent log for gc11:
but the delete saga was created 44 seconds later, at 22:40:52, presumably because nexus was busy taking deletion requests of other instances:
I didn't find any saga error in the log but the saga also doesn't exist in the database:
At 22:42:40, I tried to recreate a VM with the same name but that failed with an
object already exists
error. These VMs are now ghost records and cannot be removed because the VMs don't have a backend and can't be stopped/deleted.The complete nexus log is located in catacomb.eng.oxide.computer:/data/staff/dogfood/may-23/.