nerves-hub / nerves_hub_web

Manage firmware updates for Nerves devices
https://nerves-hub.org/
Apache License 2.0
175 stars 63 forks source link

Fix crash on missing firmware metadata for joining #1358

Open lawik opened 1 week ago

lawik commented 1 week ago

This might not be worth doing. I was doing a bad job mocking firmware information on MIX_TARGET=host.

joshk commented 6 days ago

@jjcarstens thoughts / feedback?

jjcarstens commented 6 days ago

Personally, I don't think it's needed. If we're marking a successful update, we should have the metadata and uuid. If we don't, we should blow up cause that's totally against expectations. It seems more detrimental to have these events with a nil firmware uuid and the reason it is missing is totally hidden. I think this was just the result of the host test setup

lawik commented 6 days ago

I had this code path randomly blowing up entirely without doing any updates. Just my "host" device joining and not having fully populated nerves runtime kv. I fixed it by setting the UUID. But the telemetry event crashing the DeviceChannel connection seemed bad. I couldn't keep the device on so I wouldn't have been able to ship a fix I think.

lawik commented 22 hours ago

@jjcarstens you are entirely right that this was weird for a firmware update success.

It was also weird that a nil firmware would trigger this behavior. That's a thing we will see with onboarding via shared secrets.

Turns out the current code just interpreted a non-matching firmware metadata structure as the firmware having changed. This should be a better fix.

jjcarstens commented 22 hours ago

That doesn't sound quite right. Firmware tracking is integral to the firmware update server. Even if it is shared secret connection, the firmware metadata should still be reported with the connection. This still seems like we're making an affordance for a situational that should be exceptional

lawik commented 22 hours ago

Okay. I see your point. I'm not sure I agree that starting from unregistered firmware is exceptional. It can happen whenever someone connects a device running firmware that wasn't also uploaded.

So the initial fix removed the uuid requirement for approving a firmware update. But that I recognize was the wrong place.

This changes so that it doesn't break a device's ability to join if the firmware it reports is missing in the database.

I can see why it would be desirable to be strict on the production side but I think it is a problem if that's a requirement when someone is getting started. This means that we can't set people up with generic firmware as a starting point for then getting their device onto a NervesHub instance. And I've already had a user of NervesHub say that they wanted to do essentially that. Ship a starting firmware that can later be changed.

The current behavior is that the device gets registered but doesn't stay online because the join crashes. So the device enters the system with firmware_metadata: nil. If this situation comes up it is not possible to send firmware to the device because it can't maintain a connection.

Happy to find another angle to ensure we don't lose strictness for production use but I think it will be quite limiting for onboarding if firmware needs to be known up front.

Josh and I did just discuss having some kind of flag that distinguishes between dev/lab Products and serious production products. This might be one of the behaviors that would depend on a setting like that.

lawik commented 21 hours ago

I will work through setting up the required bits, might tell me something about how I could make it convenient without trading off any rigeur.

jjcarstens commented 18 hours ago

I think I see the misunderstanding here. There is no coupling for the firmware UUID needing to exist in the database. A device can report any firmware uuid in the metadata.

The initial situation of the PR was that no metadata was reported because it was a mocked situation running on host. Missing the required metadata entirely is what I consider the exceptional case and don't think we should make any affordances for it. Maybe we can adjust the error handling of it to be easier to deduce, but I think we need to prevent the connection and not silently allow it through with missing crucial details

lawik commented 18 hours ago

Ah yes. I will try to verify with a smaller case that I'm running into a problem where firmware_metadata missing from the db is preventing my join.