oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
240 stars 36 forks source link

Internal error in bootstrap agent accessing `/components` #2655

Open bnaecker opened 1 year ago

bnaecker commented 1 year ago

Alan and I are dogfooding on Rack 2. We were doing something else, but confused about the endpoints in the bootstrap agent. We tried to hit another endpoint that should exist to verify, and picked /components. We ran:

BRM42220070 # curl http://[fdb0:a840:2504:154::1]:80/components
{
  "request_id": "c1baf3ba-081b-48f9-9d67-8f90dbcf151b",
  "error_code": "Internal",
  "message": "Internal Server Error"
BRM42220070 # 

And see this in the bootstrap agent logs:

{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"1986-12-28T01:56:27.134627922Z","hostname":"BRM42220070","pid":682,"uri":"/components","method":"GET","req_id":"c1baf3ba-081b-48f9-9d67-8f90dbcf151b","remote_addr":"[fdb0:a840:2504:154::1]:62763","local_addr":"[fdb0:a840:2504:154::1]:80","component":"dropshot (BootstrapAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error accessing version information: Malformed version in artifact /opt/oxide/crucible_pantry.tar.gz: Missing 'pkg'","response_code":"500"}

I don't know what's supposed to be happening here, but the 500 and the error about a malformed artifact seem unexpected.

smklein commented 1 year ago

I think it's arguable if this response code is correct, if unsatisfying.

Here's the contents of your oxide.json, (I assume):

{"v":"1","t":"layer"}

In #2448 , we added the ability to "stamp" versions onto zone images.

This is necessary mechanism to label versions on zone image, and to have them self-report their versions.

If we stamp a zone image, it'll have the following format:

{"v":"1","t":"layer", "pkg":"crucible_pantry", "version":"0.0.0"}

This is the format that the /components endpoint was added to parse -- it will let wicket query for existing versions of software on the sled.

However, there's an unfortunate reality: right now, I don't believe anything is stamping versions onto these tarballs. We've discussed doing this in permission-slip prior to constructing a TUF repo, but it hasn't happened yet.

TL;DR: The error message sucks, but this is not surprising to me. We want versions to be stamped on tarballs more regularly.

smklein commented 1 year ago

As a follow-up on the error specifically - I do kinda think that if the sled agent is operating on unversioned software, we should consider that an error. A reasonable follow-up may be to ensure that all zone images produced have at least a placeholder version, until they are passed through the release engineering process?