tinkerbell / pbnj

Service for interacting with BMCs
Apache License 2.0
109 stars 38 forks source link

Non-deterministic behavior in serialized contexts #130

Open chrisdoherty4 opened 2 years ago

chrisdoherty4 commented 2 years ago

This issue is based solely on observations and code study. As with any non-deterministic behavior, it can be difficult to pinpoint and correct so more eyes and validation would be useful.

PBnJ operates on requests asynchronously. When sending multiple, serialized requests to PBnJ it accepts them but has no context. As requests are serviced asynchronously its left to the go runtime to perform non-deterministic scheduling of the goroutines resulting in non-deterministic behavior. For example, sending a boot device request followed by a power on request can result in machines starting but not with the expected boot device.

I don't think this is a bug, I think its a design flaw. PBnJ would benefit from a deterministic API of sorts removing the need for consumers to synchronize their requests.

Possible Solution

(1) Introduce synchronous APIs. This would ensure RPCs don't return until the action has actually been carried out greatly improving the consumer experience.

(2) Offer an API that allows specifying multiple actions per request. This would allow PBnJ to operate on the actions asynchronously still but provide the context so the actions can be made serially.

(3) The most complicated. PBnJ could manage a queue per BMC. When a request is received and no queue is present for the endpoint create a new one and hold it in memory in an LRU cache, possibly with timeouts too (just to control the memory footprint more intentionally). Add the task to the queue and have a job management construct with N workers plucking from the various queues. You can't start the next item in the queue until the previous one for that queue has finished but you can pluck from other BMC queues. This hinges on some sort of static data for BMCs available in a request so you can lookup the queue. I suspect the IP address would be static enough given it makes little sense for BMCs to be dynamic (maybe people use a fancy dynamic DNS setup?). This is akin to a job management system.

Steps to Reproduce (for bugs)

Its a race condition so hard to reproduce. Run boot device and power on requests in that order enough times and you'll probably see it.

Context

On EKS-A we observed, when beginning the provisioning process, that machines booted into disks despite being asked to PXE boot. The BMC capability offered by EKS-A ensures we can provision machines even if they have an existing image on the OS.

jacobweinstock commented 2 years ago

Thanks for the details around this @chrisdoherty4. What are the current work arounds? Does EKS-A check the result of each request before moving on? A sync API wouldn't solve this issue if the client is just firing these calls off without waiting.

Moving to an entirely synchronous API has been brought up before. I think its a good starting point to discuss in further detail.

chrisdoherty4 commented 2 years ago

EKS-A's workaround is Rufio 😆 . Rufio is designed to avoid this particular problem. That said, a time.Sleep(time.Second) in the client would likely do the job but obvs isn't good.

With respect to PBnJ, a synchronous API would ensure a client knows a request has hit a BMC because PBnJ wouldn't respond until that's done. Its the difference between PBnJ making the statement "I'm working on it" (async) vs "I've done it" (sync). By having an async operation performed out of order relative to other requests, PBnJ forces clients to have to inspect BMC state before sending a subsequent request to perform another action which forces complexity into the client that could otherwise be handled by PBnJ.

My concern with a sync API, while simple, is around BMC responsiveness. I recall a chat we had around experiencing BMCs taking quite some time, too long for a synchronous API to typically want to hold.