oxidecomputer / transceiver-control

Crate for controlling optical transceivers over the network
Mozilla Public License 2.0
7 stars 0 forks source link

Need a design for SP -> Host information flow #91

Open Aaron-Hartwig opened 1 year ago

Aaron-Hartwig commented 1 year ago

We don't really have a design for the SP -> Host class of requests yet to my knowledge. We could have the host poll regularly (via a new HostRequest message )and get the updates in bulk, but I'd rather we build out SpRequest messages to indicate presence changes, interrupts from modules, and hardware faults. This also feels like an important thing to solve imminently, so I'd propose we add it to the FCS milestone.

@bnaecker thoughts?

bnaecker commented 1 year ago

How apropos. I'm currently adding a polling loop into Dendrite now, which fetches the state every so often. I might argue that's good enough at this point, but I don't really know what's involved in getting the SP to emit these updates. Does that happen in the FPGA? Or is there a polling loop in the SP code? Another question is, would the SP emitting these messages impact correctness, or just latency? I suppose if state is changing more quickly than the polling loop, we'd lose resolution around those transients. But for steady-state behavior we'd learn what we needed.

Aaron-Hartwig commented 1 year ago

It's polling the whole way down right now, so doing that at another layer would keep the model simple and that has been good enough. The SP just polls the FPGAs to get updated state, which it would then pass on to Dendrite at some frequency. The only impact here is latency. Any issues which require immediate action (thermal/power problems) have action taken directly by the SP or FPGA without needing Dendrite in the loop anyway.

bnaecker commented 1 year ago

So, thinking through the design a bit. If we have the SP poll, we need to decide at least two things: (1) how often to poll and (2) where to send messages. I think that means we need a host request message for configuring those things, something like:

struct NotificationSetup {
    pub interval: std::time::Duration,
    pub address: std::net::SocketAddrV6,
}

After the SP acknowledges that, it would need to continually send messages, though we may want it to decide to remove that subscriber at some point if enough of the messages fail or are not acknowledged. It seems like we'd also want a way to deregister as a subscriber too.

Does that all make sense and sound reasonable? I think my worry is that it's a good bit of extra code and state to put into the SP. We'd need to keep track of subscribers, though I guess that could be limited to at most one at a time. (Or some other small number.) We'd need logic for polling, and for handling the case where we fail to send updates, possibly removing the subscriber at some point.

Actually, now that I think about it, we'd also need to decide which states we want to be notified about. Is it just status changes?

Another issue that comes to mind is that we might need to put the logic for polling into the host-side anyway. The main reason is so that we could tell the difference between getting no updates from the SP, and a network partition that prevents any updates from getting through.

Anyway, that's just a few random thoughts, let me know if that shakes anything loose.

Aaron-Hartwig commented 1 year ago

If we have the SP poll, we need to decide at least two things: (1) how often to poll and (2) where to send messages.

Why does this need to be configurable? Also, isn't there only one place to send the message (the attached Scrimlet)?

After the SP acknowledges that, it would need to continually send messages, though we may want it to decide to remove that subscriber at some point if enough of the messages fail or are not acknowledged. It seems like we'd also want a way to deregister as a subscriber too.

Pub/sub seems complicated given we already have a model for this type of SP/Host communication (one request, one response).

Does that all make sense and sound reasonable?

Not really, I don't think we are on the same page here. I must be missing something.

I don't see why this warrants an entirely new paradigm to what we already have in place. My vision (which I acknowledge is not up to date with where you are at currently) would be to choose one of the following:

Lets sync on this sometime soon so we can get on the same page, I think I'm lacking a piece of understanding that you have.

bnaecker commented 1 year ago

@Aaron-Hartwig and I talked about this in person today. We're both in agreement that this should be deferred for now. We'll wait until we have good reason that polling isn't sufficient, which is what Dendrite and the Tofino SDE do today. That'll hopefully inform the design of the updates, both to the message protocol and the SP code as well.