openshwprojects / OpenBK7231T_App

Open source firmware (Tasmota/Esphome replacement) for BK7231T, BK7231N, BL2028N, T34, XR809, W800/W801, W600/W601, BL602 and LN882H
https://openbekeniot.github.io/webapp/devicesList.html
1.44k stars 263 forks source link

Duplicate MAC addresses after OTA upgrade to 1.15.1 #443

Open ReanuKeeves01 opened 1 year ago

ReanuKeeves01 commented 1 year ago

Describe the bug I updated one of two rgb bulbs last night to the latest OpenBK (1.15.1) - after the OTA and a reboot both lamps came back online. Both had their interfaces work nicely, but both didn't report MQTT statuses anymore. So i checked my settings and did a fresh restart. lamp 1 came online, started reporting MQTT again and is fully working again.

Light number 2 however, after rebooting is available and online through MQTT, I can change and read its status as usual. The web interface however doesn't work anymore. I thought perhaps it's gotten a new IP/DHCP lease, so checked LanScan, but this 2nd bulb simply doesn't show up anymore.

Anyway i can troubleshoot this without hard resetting the bulb?

Also, what would be the correct steps/process for hard resetting OpenBK bulbs? I just realized I don't know how to reset them.

Firmware:

openshwprojects commented 1 year ago

This sounds like a serious issue. Can you specify which version was stable and when the issue started? We need to trace back the commits.

@ReanuKeeves01 is it possible that it's issue related to the new LED effect lerp? Have you tried disabling it?

Edit: Hmm my table bulb with yesterdays build is still reachable... image

IDEA: Do quick 5 power on/off cycles of the bulb, connect to AP, disable the smooth LED transitions, reboot bulb and tell me if it still has issues. NOTE: during safe ap mode, the bulb will not give any colors, as in SAFE mode, the outputs are disabled and it's not possible to control pins

ReanuKeeves01 commented 1 year ago

Gotcha, i will have to try this tonight. Just left for the office for a couple of hours. I did not try the new effect, so i doubt if it has anything to do with that. One thing i just remembered, when i was trying last night to get the mqtt of that bulb up and running again, i might have switched flag 2 on for a minute or two. Im sure i turned it back off again, but could this perhaps have something to do with it?

One other thing I just noticed. And it's already 'fixed' again, so i can't say 100% if this is accurate. So I check my Client list on the router. I see 'Cactus lamp' which has ip: x.x.2.60 (And always had) - so i try visiting that one, to make sure other updated bulbs still work. It loads up, but strangely enough it's bearing the name/title of the 2nd rgb bulb that's missing. Which sat at x.x.2.48 before.

So in short, i visited the expected IP of one OpenBK instance, here I expected this 'cactus lamp' - yet i got the settings/menu for the RGB lamp, which had a totally different ip before. I'm not sure if it's a glitch or i'm being stupid. 2 minutes later the page goes unresponsive again, and a few minutes after that the original 'cactus lamp' shows up again under that ip.

Could this perhaps be an IP assignment issue or something of the likes? Perhaps hostnames that are identical ? One thing i will surely do tonight is setting them all up with reserved IP's and making sure their hostname/mac's are all unique.

openshwprojects commented 1 year ago

One thing to check would be MAC address duplication. If there are two devices with the same MAC, very bad things will happen.

It is possible also that there is a router issue and it has too many clients or the DHCP lease has ended and somehow device got different IP.

Of course, it's also possible that there is some kind of bug, so please report anything you find and make sure you include information about version and config.

But in general, during last days, the only big change was that LED color lerp... I can't right now think about anything else that could broke things.

ReanuKeeves01 commented 1 year ago

Alright, I will do some further investigation tonight. Need to properly asses them and then check all the variables to really get down to the possible issue.

Will report back once I know more.

ReanuKeeves01 commented 1 year ago

@openshwprojects ok, this was bothering me a bit more than I thought, so i did some further investigation while using VPN, and can confirm it is indeed a duplicate MAC address issue. I now have three bulbs, all with the same MAC address: c8:47:8c:00:00:00

When I switch them on/off separately from each other, I can reach them one by one. The minute a second one is switched on it goes haywire and the web interface isn't reachable anymore.

Tried manually adjust the MAC address in the config (tried this on both bulbs, both on 1.15.1),. after submitting it defaults back to the original MAC:

Screenshot 2022-11-08 at 10 00 06

tldr: I now have 3 bulbs/instances of OpenBK all bearing the same MAC address, adjusting the MAC manually to unique values doesn't seem to work for me.

GravityRZ commented 1 year ago

this looks like the same Mac address bug we had before the old bug was that if you flashed an old firmware bulb (before 106) to a new version the rf partition was partly overwritten the "restore RF config" button would fix this(do not do it yet)

@openshwprojects so the RF config changing aparently can still happen.

my testlight is running well on 1.15.2 i think i flashed from 1.14.142 to 1.15.2

ReanuKeeves01 commented 1 year ago

@GravityRZ I still have a bulb on the 'older' firmware so i can check later on which version i upgraded from.

My steps for doing OTA were:

  1. Through the web-app -> OTA tab -> drag n drop .rbl file
  2. I then clicked the 'Start safe OTA (keep LittleFS data)
  3. Everything starts processing and the flashing starts. It then does a countdown for a reboot, device reboots, the page now mentions 'restoring' ... after which it says 'Sequence completed' - that's the end.
  4. I now check the web interface -> works -> restart the device from the interface

this is where problems started.

openshwprojects commented 1 year ago

Huh... you keep FS data? But do you have any? Why not quick OTA?

@btsimonh , there have been multiple reports of that. IT seems that the "keep FS data" while used to update from before-small-lfs version to after-smallfs-verstion breaks the partition...

The only thing you can do now @ReanuKeeves01 is to use my RF restore tool...

I am beginning to understand. It seems that @btsimonh change of LFS handling with the addition of "safe OTA" may break things.... still, I don't know why...

GravityRZ commented 1 year ago

That’s how i did it also.

Had de problem when i upgraded from 1.12.xx to 1.14

you can probably fix your problem with the rf config restore but wait until somebody tells you it is save.

Then hit the button, wait 5 minutes because uou do not get any message and after that reboot After that you can chsnge back the mac to original

ReanuKeeves01 commented 1 year ago

@openshwprojects to be honest, i wasn't quite sure which option i should use when doing an OTA upgrade. I assumed that would be the safe option where as the other option sounds 'scary' (delete all littleFS data)

Perhaps an entry in the readme/instrucitons on how to do a safe and proper OTA would be helpful for future issues?

I'm also not entirely sure I understand how the RF RESTORE works or should work. Seeing it's only 3 bulbs at the moment, isn't it easier to hard reset them, and set them up afresh? Or will this duplicate mac issue still remain?

openshwprojects commented 1 year ago

@ReanuKeeves01 MAC issue will remain until you either use our RF restore button or restore the RF section through the backup of original firmware, if you have one.

It's not about the lack of readme. It's simply a byproduct of LFS system change that we did not expect. I would need to consult @btsimonh about that, because he is the mainter of LFS and OTA and I don't know much about it.

You can TRY doing restore RF partition on one of the bulbs and report here results.

ReanuKeeves01 commented 1 year ago

@openshwprojects i unfortunately don't have a backup, unless it's saved somewhere on the device while doing an OTA upgrade.

I'm happy to try the RF Restore on one of the bulbs and see how that works out. Just to be on the safe side, we are referring to this button/option right?

Screenshot 2022-11-08 at 13 48 11

As there's a 'restore fsblock' on the OTA tab. And there's also a 'Restore fsblock' on the Filesystem-tab.

In the event this doesn't work, does it mean the device is 'bricked', or could i re-flash them or something? Apologies for the ignorance.

openshwprojects commented 1 year ago

This button, Restore RF Config. It requires a manual restart afterwards.

ReanuKeeves01 commented 1 year ago

@openshwprojects Ok, gotcha! I will give this a try once I'm back home in a few hours. Although I can do the RESTORE remotely, I have no means to remotely reboot the device without physical access. will follow up with results.

Fingers crossed :-)

GravityRZ commented 1 year ago

@ReanuKeeves01 do one light at a time(switch the others with the same mac off) reboot and change the mac back to what it was(probably in your DHCP assignment list)

i never assigned the light a permanent ip address in the beginning so i thought i lost the original mac It turned out that my router has cached is so i could restore the original

@openshwprojects how long does the RF restore process take since there is no popup when it is done it would be nic to know after approx how many minutes we can restart the device.

FYI i also used the save OTA method(which apparentlys is not that save HaHa)

ReanuKeeves01 commented 1 year ago

@GravityRZ is it important at all to use the original mac address? Or can we just rely / fallback on a random one? I have no way to recover the original mac addresses, by now they all show up in the client list and the historic list with the same wrong MAC (ending in 0000000)

I don't really care much about which MAC it gets assigned. The minute i get them back up and running with unique mac address and IP i plan to use a DHCP reservation to avoid further issues...

openshwprojects commented 1 year ago

@ReanuKeeves01 you can just reboot by "restart" button... you don't need physical access for that.

@GravityRZ you can just click "read RF partition " and see if it changed and got back TLV header

ReanuKeeves01 commented 1 year ago

@GravityRZ @openshwprojects

Ok, success (with one instance at least).

Steps I've taken:

  1. Use the 'Restore RF config' button on the FLASH tab (gotta love the extra warning you've put in place ^^)
  2. Go get a drink, and wait for aprox. 5 minutes.
  3. Refresh the normal web interface and hit RESTART
  4. Light goes off, and comes back on almost instantly
  5. Device not reachable on the original IP anymore
  6. Checking my Router's client list I can see a new IP/MAC combo joined the network (GOOD!)
  7. Success - device got a new (random?) mac address and was assigned a new IP
Screenshot 2022-11-08 at 17 17 07

I will now do these steps carefully for all 3 bulbs and hopefully all is back to working order.

Note to self: Use the QUICK OTA option next time

PS: Perhaps a good idea to put some sort of label/warning/notice at the SAFE OTA option. As clearly it's not that safe, which is why i choose that option in the first place (safe = good right)

openshwprojects commented 1 year ago

The [[EDIT] safe not quick] OTA bug was TOTALLY not expected. I still don't know why it happened. I am waiting for @btsimonh comment on that issue.

Hmm the restore RF indeed randomizes MAC, I have added it in the VUE code few days ago.

GravityRZ commented 1 year ago

you mean the SAVE OTA BUG right!

The quick OTA bug was TOTALLY not expected. I still don't know why it happened. I am waiting for @btsimonh comment on that issue.

Hmm the restore RF indeed randomizes MAC, I have added it in the VUE code few days ago.

ReanuKeeves01 commented 1 year ago

@GravityRZ most definitely i used the SAFE OTA button.

For my own sanity, I have added a simple entry in my TamperMonkey browser extension that appends a line to the SAFE button and makes it red... this helps me sleep better knowing I won't use that option again next time. ^^

Screenshot 2022-11-08 at 17 44 14
GravityRZ commented 1 year ago

maybee change the subject og this thread to MAC address erased after OTA upgrade this way it is easier to spot and people with specific knowledge about it will react

ReanuKeeves01 commented 1 year ago

@openshwprojects should I keep this issue open while the possibility of it occurring to others still exists?

btsimonh commented 1 year ago

yes, I can imagine that the backup/restore of LFS could be an issue if it reads 512k, and tries to write 512k back to an address only 32k less than the end of the OTA partition. The RF data is the 'next' block of flash :(. I'll check the code. It makes no sense to try to upload an LFS volume which does not match the size that the device believes it should be.... and the RF and config partitions could probably do with a little more base level protection against overwrite.

btsimonh commented 1 year ago

thanks to all who reported this one. I found and resolved the flash overwrite. It's included in the latest -alpha (you can find it in releases - not the latest release, but in the releases list itself.). The issue was the use of the button to backup and restore LFS. It was able to backup 512k, and then attempt to restore 512k to a much later address in flash. Now the OTA routines restrict writing to within the OTA area.

But also, LFS backup will not be required until we add another 100k to the OTA image, since LFS is now small and at the end of the OTA area.

Also, the webapp now has a downlaod of ..tar of LFS, and dropping a .tar will upload the content to LFS. A new backup/OTA/Restore will come when required which will backup to tar, and restore from tar, rather than direct flash block access.

GravityRZ commented 1 year ago

most lights are restored by using the restore button but when you compare an affected light and a non affected one you still see differences at the end(line 01E0). the affected ones have more FFFF's on the last line (01E0)

anything known on what is overwritten and how bad this is e.g do we need to fix this eventually?

see this thread where i dumped the data https://github.com/openshwprojects/OpenBK7231T_App/issues/430