openshwprojects / OpenBK7231T_App

Open source firmware (Tasmota/Esphome replacement) for BK7231T, BK7231N, BL2028N, T34, XR809, W800/W801, W600/W601 and BL602
https://openbekeniot.github.io/webapp/devicesList.html
1.34k stars 228 forks source link

Alternate sprintf causing crash under. #395

Closed valeklubomir closed 1 year ago

valeklubomir commented 1 year ago

Both platforms BK7231N and BK7231T cause crash under new sprintf2. Probably missing some formatting options used in the code. Example: image

valeklubomir commented 1 year ago

@openshwprojects I hope this was not released.

valeklubomir commented 1 year ago

I upgraded 5 deviced and all are bricked. Forcing SAFE mode. After disabling MQTT devices stays active and accesible over network. But OTA fails. end reboot device. image

Tested on BK7231N and BL7231T

valeklubomir commented 1 year ago

PLEEEEEEEEEEEASE do not present the new commits as release, at least until 10-20 experienced user/developers can prove that is is working. Mark the as test release.

valeklubomir commented 1 year ago

OTA fails because snprintf2 with "%u" is not supported.

valeklubomir commented 1 year ago

More format missing: cJSON: "%1.15g", "%1.17g", "%04x" drv_itr: %lX %X rest interface : %.*s memtest: %08X

openshwprojects commented 1 year ago

That's strange, I enabled it in one of the latest commit, but I can undo the release.

It was tested on Windows, BL602 and BK7231T.

It was never supposed to be cover all formats, used formats should have a stubs or something.

Without this sprintf fix, with current codebase, on BK7231T I get a crash on "testLog" command and on "startDriver testPower" commands every time. This is why it was hastily pushed. I apologize for inconvenience.

We really DO need some kind of separate release and dev branches (both with autobuild), but still, what do we do about BK7231T being unable to use with original sprintf now?

EDIT: It seems indeed HTTP OTA could have failed, I have added a temporary fix, but need to look deeper into that sprintf2. The Javascript Panel OTA seems to be working for me, I already tested that.

It seems that the only possible fix for BK7231T instability now is using my sprintf replacement only for addLogAdv and HTTP htprintf256 ...

EDIT2: Can you try disabling drivers and do OTA through Javascript Panel in non-safe mode?

EDIT3: After a brief look, I can see it was indeed a bit of premature release. Still, without replacing sprintf at least in addLogAdv and HTTP, people will get crashes on any power driver running (BL0937, BL0942, etc) on T, or at least I am getting that on TWO separate WB3S devices. We need to solve it ASAP

openshwprojects commented 1 year ago

One of the other reasons sprintf was hastily pushed: https://github.com/openshwprojects/OpenBK7231T_App/issues/387

openshwprojects commented 1 year ago

I have did a clear build and cleared 100% of memory of WB3S (this button with Chinese character on bkWriter 1.60 does that, it's a erase button, it can erase given range of flash, including whole configuration) and flashed with latest version (with sprintf disabled again), I entered WiFi creds, then I loaded this config: image then type in console: "startDriver testPower" and device is dead within one second. You can also use "testLog" command, it will do the same.

@valeklubomir , I don't want to make the same mistake with hasty changes, but what do you think? Shall I, for now, just change the vsnprintf used in addLogAdv and httpPrintf255 or whatever it was called? Does the same crash happen for you?

valeklubomir commented 1 year ago

@openshwprojects It was just inconvinience for me. For others it is catastrophic. Took me one hour to resolved cables and flash all devices back. Crash occured on all my devices: Plug BL0937, Plug BL0942, Touch panel TuyaMCU, Ceiling list PWM, Light bulb PWM. I was too hasty also, usually I flash first devices which are connected to programmer. And check them. Now it was late and wanted to test new snprintf2, and flashed them all, without waiting for the test devices. Bad decision.

I will definitely keep one BK7231N and T as test devices in the sandbox, and once version does pass all tests, then upload to other. Actually this could be new feature like auto-diagnostic. Single command executed which will test all critical point which we collected so far.

I had to break open one crashed device, and this process can not be repeated very often. It has poor internal design from point of view, when trying repairs, it was constructed like one way device, once broken it is trash.

How does work OTA through Javascript Panel in non-safe mode? We could use OTA in safe mode. Can we upload firmware in AP mode/safe mode?

Regarding bk_writer, I used it twice. I rather not do full flash erase. Because also MAC is erased. And then I have to select new, usually do MAC+1 from other device.

EDIT: TestDriver crashed on my BK7231T within one second. I was trying to locate the point, but it was very variable, probably different thread than RunEverySecond.

valeklubomir commented 1 year ago

Problem is the variable arguments list in the snprintf2. You need to pull all variables from stack before return from function. When then number of %fnt and number of parameters passed to function do not match then stack after fuction call executes return to wrong call address. If we find % and do not recognise format we still have to pull parameter from stack. May be this way we can filter all undefined formats one by one.

openshwprojects commented 1 year ago

That's not good, I apologize for that sprintf2 issue, it didn't occur to me and, as said before, it was a hastily decision because it fixes issues in addLogAdv. I will say again - current build (idk exactly where it started, maybe LWIP, maybe not) on BK7231T will crash immediatelly (maybe within one second) for me on "startDrivers TestPower" and "testLog" commands, while it's working indeed way better if I just replace addLogAdv asnd htPrintf255 vsnprintf with my version.

Can you check if the same crash occurs to you or is it specific to my two WB3S devices? Can you check if the crassh occurs if you do a vsnprintf resplacement in those two places I mentioned?

How does work OTA through Javascript Panel in non-safe mode? We could use OTA in safe mode. Can we upload firmware in AP mode/safe mode? @btsimonh knows more about it, but in safes mode the resst interface is not initializesd. We could add a button in safe mode that says "Init REST functionality" or just add a console command. This is becauses the Safe Mode was designed to run with all modules/features turned off so we can recover device from bad config (eg. remove some pins from Configure Module). It should be possible to use JS OTA in Safe Mode if we just had a button or console command to init rest. One also might need to alter the function for "Get My IP" to return correctly the IP of device in AP mode. Of course, the javascrsipt panel also requires network connection, but for me it's not a problem, I have WiFi and cable both connected.

The ideas with removing arguments from stack when found unsupported parm is also very good!

PS: @valeklubomir special thanks for figuring out the trick with disabling TCP send from MQTT LWIP "publish" and "sub" calls. I think you know what I am talking about. I did that for BL602 and stability of this platform has incresaed.

btsimonh commented 1 year ago

clear explanation of float vs double in va_args:

https://stackoverflow.com/questions/23836118/variadic-function-va-arg-doesnt-work-with-float-while-printf-does-what-the https://stackoverflow.com/questions/11270588/variadic-function-va-arg-doesnt-work-with-float

so... in any function using va_args, you MUST extract floats as double ....

openshwprojects commented 1 year ago

Well @btsimonh , my current proposal is simple - use the vsnprintf replacement for addLogAdv and for htPrintf255 or whatever it was called. I am outside of home now and I don't have access to BK devices, I will be back today evening.

If you can double-check if my idea would work, please do. I don't want to repeat the mistake with commiting untested solution.

The main question still lingers - why has T platform stability decreased, If it did... to be honest, at the moment I am not 100% sure if I was testing the @valeklubomir new power metering features on N or on T... but if I remember correctly, during yesterday tests, my 4CH relay WB3S managed to crash even without enabling testPower driver and without testLog command, just by playing with MQTT, what is a really new to me on T.

I will do more tests...

valeklubomir commented 1 year ago

I don't know if this matters here, but may be some idea could be sparked:

btsimonh commented 1 year ago

@valeklubomir - please could you pull and test this branch: https://github.com/btsimonh/OpenBK7231T_App/tree/testfloats

I have used a fairly well supported nanoprintf from here: https://github.com/charlesnicholson/nanoprintf

The only down side at the moment is it seems to default %f to 6 decimal places if no precision is given. i.e. 3.14 prints as 3.140000

If you take a look and think it's a workable approach, tonight I will make _wrap functions for vsnprintf, snprintf, sprintf, and propose application.mk changes for T and N. I don't have any of the other platforms (and have not had any experience of them), so don't know if the __wrap will work there.

(if you are not familiar with the wrap stuff, I wrote it up in the wiki: https://github.com/openshwprojects/OpenBK7231T_App/wiki/T-&-N-Hardware)

@openshwprojects - "#define printf addLog" may have had wide ranging consequences - like vsnprintf() -> vsnaddLog() See replacement in new_common.h... this commit is probably essential to include. I'm not even sure we have addLog()???

btsimonh commented 1 year ago

hmm... other platforms (BL602, W800) do not have inittypes.h? https://pubs.opengroup.org/onlinepubs/009695399/basedefs/inttypes.h.html

In file included from ../sharedAppContainer/sharedApp/src/new_common.c:17:0: ../sharedAppContainer/sharedApp/src/nanoprintf.h:60:22: fatal error: inttypes.h: No such file or directory

include

openshwprojects commented 1 year ago

@btsimonh that is not possible: image

define does not expand substrings. It only replaces if string matches 1:1 the given one. Other than that, I will look into it. I don't know about inttypes, tough.

valeklubomir commented 1 year ago

@btsimonh What should I test ?

EDIT: executing command testFloats does not crash.

btsimonh commented 1 year ago

yep, basically that, and have a read of that repo - see if you like it. I know you found an alternate vnsprintf at some point, I found one, and did not like it (completely unknown source), @openshwprojects found one and it was buggy and supported not a lot. So your opinion on this one would be good :). If you think it's likely to have no major flaws for us, I'll do the wrapping so we change EVERY sprintf, vsprintf, snprintf and vsnprintf in the whole codebase, and try on my device which has wired serial :).

btsimonh commented 1 year ago

@btsimonh that is not possible:

put it back, and see strange warnings from inclusion of nanoprintf.h (line ~13) I swear something VERY strange was going on....

openshwprojects commented 1 year ago

@btsimonh I am not convinced it is possible for #define to act that way. Please see: https://ideone.com/y3XE9U

The range of sprintf support is one thing, the stack size required and speed is a totally other thing. I'd prefer to have as much lightweight spritnf as possible.

I have been considering using a 'light' sprintf/vsnprintf for addLogV and htPritnf255 and keeping old sprintf for some rare cases...

Also: image note of warning: I noticed that changing all sprintf/vsnprintf etc in entire code base (not just App) might even break wifi connect , furthermore it seems to be breaking wifi connect permanently until full memory clear of BK.

btsimonh commented 1 year ago

the stack size required and speed is a totally other thing

this one is designed for embedded (see repo linked above) - plus by wrapping and not referencing real functions, we will save some flash :).

let me try here. If the original functions have an issue, I'd rather nothing used them.

valeklubomir commented 1 year ago

Runnung TestDriver on T. It look like SW deadlock, endless loop. After Command Start TestPower is executed, timer executed once "Main_OnEverySecond" completely and then everything is "frozen". I will add Button/LED to simulate Button/Relay, to check status of QuickRunFrame. But MCU does not freeze it has to run always, except deep HW failure. CPU has exception to detect failure regarding access to memory. I thing we got it, Because exceptions for Undefined instruction, prefetch abort, data abort end in endless loop: image Examinig STACK frame could help us to locate program section when it occured.

EDIT: N Version has handlers for this exceptions.

btsimonh commented 1 year ago

I just did the wrapping, and my device is running. TestFloats and TestLog are as expected, no crash. Ref vectors: I looked at trying to modify our vectors, but was unsuccessful. I seemed unable to replace them - even played with VTOR.

startdriver testpower Also survives for me. (I guess it's simulating something? I see mqtt traffic...)

will push on the same branch. You will need:

CFLAGS += -DWRAP_PRINTF=1
LFLAGS += -Wl,-wrap,vsnprintf
LFLAGS += -Wl,-wrap,snprintf
LFLAGS += -Wl,-wrap,sprintf
LFLAGS += -Wl,-wrap,vsprintf

in application.mk

btsimonh commented 1 year ago

@valeklubomir - please pull my changes and modify your application.mk, and try again. I want to hear some positive results! :). I have a little time tonight. Wife is out for a few hours, so let me know...

btsimonh commented 1 year ago

Now running the same thing on N. Again, all tests run and no death (yet). (no mqtt logs on N - I guess they are in the OS?)

Had to exclude printf.c in application.mk

mods:

SRC_C += ./beken378/driver/uart/uart_bk.c
#SRC_C += ./beken378/driver/uart/printf.c
SRC_C += ./beken378/driver/wdt/wdt.c
LFLAGS += -Wl,-wrap,malloc -Wl,-wrap,_malloc_r -Wl,-wrap,free -Wl,-wrap,_free_r -Wl,-wrap,zalloc -Wl,-wrap,calloc -Wl,-wrap,realloc  -Wl,-wrap,_realloc_r
#LFLAGS += -Wl,-wrap,printf -Wl,-wrap,vsnprintf -Wl,-wrap,snprintf -Wl,-wrap,sprintf -Wl,-wrap,puts

CFLAGS += -DWRAP_PRINTF=1
LFLAGS += -Wl,-wrap,vsnprintf
LFLAGS += -Wl,-wrap,snprintf
LFLAGS += -Wl,-wrap,sprintf
LFLAGS += -Wl,-wrap,vsprintf
valeklubomir commented 1 year ago

@btsimonh TestDriver on T version is stable.

btsimonh commented 1 year ago

both T & N also survive IRBlast test (button on remote held down for 2 mins, whilst getting http logs, etc.,) and wifi did not die, and I assume mqtt ok!!!!

btsimonh commented 1 year ago

@openshwprojects - i have to pop out for 40 mins. But after I get back I can prepare a cleaner PR, and PRs for the two SDKS? let me know... Good team effort.... br, Simon

Edit: doing clean builds on both just in case...

openshwprojects commented 1 year ago

@btsimonh please do prepare a PR. If there is anything to check for me, also do tell. I was less active today. I will have more time tomorrow.

EDIT: some other work done, I added ability to restore RF config in web app for people who lost their MACs

btsimonh commented 1 year ago

@valeklubomir - PR is in, sdks updated. Only remaining question is other platforms, and do they need the wrapping.

valeklubomir commented 1 year ago

Found difference between T and N regarding printf. N version has local copy of printf.c at platforms/bk7231n/bk7231n_os/beken378/driver/uart/printf.c T version did not have it so GCC used version from LIBC May be that was the difference. But now we have new printf nanoprint may be better results.

valeklubomir commented 1 year ago

I can confirm this version is working on all my N and T device:

openshwprojects commented 1 year ago

@valeklubomir we have a report that led color MQTT topic has now invalid string value, see latest issue

also please note: I had a %f crash on N platform already months ago.

this looks so stack-heavy: https://github.com/openshwprojects/OpenBK7231N/blob/master/platforms/bk7231n/bk7231n_os/beken378/driver/uart/printf.c also that nested fmt call with two buffers of 43 bytes

btsimonh commented 1 year ago

printf.c is now excluded from the N build. I think we can close for now - maybe raise another issue for porting to W600, W800, etc.?