nielsonm236 / NetMod-ServerApp

Reprogramming the Web_Relay_Con V2.0 HW-584 Network Module
71 stars 20 forks source link

Bug: Browser-UPG load, pin 16 can't be used after power cycle #222

Closed dbaker1 closed 2 weeks ago

dbaker1 commented 1 month ago

Using the Browser only UPG build, pin 16 cannot be turned ON with the 31/50/51 REST commands. Channel 16 can however be turned OFF using 30/50/51 once it has been turned on with 55. 55 appears to be the only way it can be turned on.

Everything works fine with the standard browser only build.

dbaker1 commented 1 month ago

After repeated firmware flashes to try to get it working, I used the /74 option and then it worked upon the next full reload.

nielsonm236 commented 1 month ago

/74 to completely erase the EEPROM might have had some effect ... but I really doubt it unless you did a lot of weird experimentation with several code types before that. But, I've been surprised more than once. :-)

Since you mention pin 16 I have to ask if you are using one of the 16-channel relay boards. I only ask because there are several designs for those boards and they almost all have 5 volt IO control logic, whereas the HW-584 board uses 3.3 volt IO control logic. This can lead to relays being stuck ON, and/or relay control being intermittent. I have a rather long section in the manual discussing the problems and "fix" ideas for the problems.

Glad you got yours working. Let me know if it becomes problematic again. Mike

dbaker1 commented 1 month ago

Hi Mike, thanks for the tips! And yes, I've read through your manual at great lengths and the board I have is the one with the power LED by the pins and seems to be fine as-is. It also works perfectly with the non-UPG browser code at all times.

So with further testing /74 does not seem to be needed to recover the operation of pin 16, simply reloading everything through /72 does the trick. The problem is that if I do a /72 and reload everything, as I mentioned, it works fine again...until a power cycle. Then pin 16 is not controllable at all until I go through /72 and reload everything again. This is 100% repeatable for me. Rebooting does not affect it.

I will try another board later to see if maybe the one I'm currently using is somehow screwed up, but it's very consistent.

nielsonm236 commented 1 month ago

Is DS18B20 disabled in the Configuration menu?

dbaker1 commented 1 month ago

Yes, all features are disabled. I am only using REST commands to control the relays in output mode. Again, everything works perfectly with the non-UPG Browser firmware. I have another identical system except with standard Browser firmware controlling all 16 relays in rapid fire automation (/51) without any issues. I have even tested it under load and it will do about 15 commands per second without any issues, tested to 10000 commands. I was quite impressed. However I only send a maximum of about one command per 2 seconds normally...

nielsonm236 commented 1 month ago

OK. I will attempt to replicate here. I might not be able to get to it until the end of the week.

dbaker1 commented 1 month ago

For what it's worth, I just setup a completely new test with a new HW-584, new relay board, new I2C EEPROM. Here are the steps I took: I flashed the relay module with the code uploader via SWIM. I re-flashed the code uploader with the code uploader, just to be safe. I loaded the Strings file. I loaded the Browser UPG file. Everything worked correctly as before, I could control all pins as normal using REST or GUI. I rebooted (/91) the HW-584 several times, each time it would work normally with all 14 output pins working. Next I power cycled, and as observed earlier, all control of pin 16 was lost at that point. However, the HW-584 thinks it's getting set as the response to /99 or /51 is always showing what was requested, even tho pin 16 is not actually turned on. I then went to /72 and re-flashed just the Browser UPG file. Doing this fixed the problem again, until the next restart...

This narrows it down slightly. Hope it helps, and thank you for the great support.

dbaker1 commented 1 month ago

Hi Mike, I did a little more testing tonight and loaded every version of the firmware to see what the results were wrt controlling pin 16. All other pins seemed to work as expected as far as GUI/REST on/off goes. Here are my findings:

MQQT Home BME UPG - pin 16 works normally
MQQT Home     UPG - pin 16 didn't work at all
MQQT Domo BME UPG - pin 16 didn't work at all
MQQT Domo     UPG - pin 16 didn't work at all
Browser       UPG - pin 16 works until power cycle
Browser           - pin 16 works normally
MQQT Home         - pin 16 works normally
MQQT Domo         - pin 16 didn't work at all

These are all from the latest release: Code Revision 20231009 1022

nielsonm236 commented 1 month ago

Thank you - you spent some time on this one. Still, it is confusing to me. I've been reviewing the code and haven't found a common thread in the scenarios that fail. So, this is going to take some code tracing to figure out what is happening. I'm still a bit tied up with family matters but will start the debug process. Mike

nielsonm236 commented 1 month ago

FYI, I've reproduced using the Browser UPG build. I haven't tried the others as I believe if I find the bug in this build it will be common to the other builds. Do you have a PCF8574 attached? I've found that if IO17 (the least significant IO on the PCF8574) is turned ON, then if forces IO16 (pin 16 on the STM8) to be forced ON. If IO17 is OFF, then IO16 works normally. This is a significant clue and you would think it would make finding the bug easy, but so far code inspection is not telling me. So, more work to do. Mike

nielsonm236 commented 1 month ago

It appears that if I turn on ANY PCF8574 output it forces IO16 to the ON state. That should be easier to find.

nielsonm236 commented 1 month ago

Well, after some additional testing the above connection between PCF8574 pin states and the malfunction on IO16 no longer seems to be true. Now IO16 stays ON regardless of what I do. I'm beginning to think I have a buffer over-run somewhere. I will keep working on this.

dbaker1 commented 1 month ago

Hi Mike, sounds like you are making good progress. Sorry I couldn't respond sooner but I came down with covid on Friday evening so I'm feeling bad and sleeping a lot. Anyway, no PCF8574 here. If there's anything else I can do to help just let me know. Many thanks for your efforts on this!

dbaker1 commented 1 month ago

I also tried loading older versions of the Browser UPG load to see if there was an earlier one that worked, but after going back 3 or 4 I gave up as they all had the issue of pin 16 working until power cycle...

nielsonm236 commented 1 month ago

Thank you - that is useful as it means recent code changes did not cause the problem. Some thinking out loud: This is a difficult bug to catch. I have instrumented the code that does the actual write to the pin, and I can see that sometimes it works and sometimes it does not. I can get it to work by doing what you did: a) Load Browser non-UPG and it will start working, b) Load Browser UPG and it continues to work UNTIL I perform a write to a PCF8574, at which point IO16 is no longer writeable. A key difference in these code loads is the use of IO14 and IO15 as bit-bang I2C pins. That is probably unrelated as there are no common registers used for those pins vs IO16, and the I2C bus is used for all EEPROM read/writes, which don't appear to trigger the problem. The instrumentation displays the write to the pin Output Data Register (ODR), then immediately performs a read of the ODR and displays that. The bug is somehow making the ODR refuse to accept a change for that specific pin. I also check the Data Direction Register (DDR) thinking maybe something changed the pin to an Input, but that did not happen. Your finding that a power cycle causes the problem, and my observation that a PCF8574 causes the problem are trying to tell me something ... but I haven't figured it out yet. All other pins are unaffected, ONLY IO16. IO8 shares the same ODR and DDR, but it is unaffected. Nutty stuff. Usually I have an A-Ha moment after this much testing. My current train of thought is "The write to the PCF8574 is done via the GUI, which requires a Save action, which is turn follows a code path through a long string of user input validation logic. A power cycle also passes through that logic. The bug must be in that path. But the counter to that thinking is that I can see the actual writes to the pin hardware, and they are all correct. What could possibly disable the pin?" There are some really obvious validation checks to manage writes to IO16 depending on whether DS18B20 is enabled, but the instrumentation is also checking that, and it does not appear to be interfering. FYI I have some family time with the grandkids for the next 5 days so this effort will slow a bit. But I will keep my laptop with me in case I get some quiet time to do some more digging. I need to look at the chip spec some more to see if I have overlooked something ... perhaps in the "pin alternate definition" logic. Mike

nielsonm236 commented 1 month ago

So ... while writing the above another thought occurred to me, and I think I may have fixed the bug, at least in my investigation test case. I'll send you some test builds and we can verify in parallel. The fix makes me wonder how IO16 ever worked properly. The issue was introduced when the ability to apply alternative pinout patterns was introduced. As an FYI, I have a bit of code that looks like this to create a mask: 1<<i. It works for 1 to 15, but doesn't always provide the correct mask for left shift of 16. Another case of C code shorthand bites me. The test builds may have to wait a few days. Mike

nielsonm236 commented 1 month ago

dbaker1: I rushed this before starting the family activities. Hopefully I didn't hose it up. You can give it a try. These are builds of the non-upgradeable Browser, Browser UPG, Code Uploader, and Strings files. You can load and run the non-upgradeable Browser via the SWIM interface. If you use the Browser UPG you will need to follow the full upgradeable install process, ie, a) Start the Code Uploader with /72, b) install the new Code Uploader, c) Install the Strings file, d) Install the Browser UPG file. KEEP YOUR OLD FILES just in case I screwed this up. These are TEST files only. I'll do a formal release after I get a chance to test all variations. temp.zip

dbaker1 commented 1 month ago

Hi Mike, that's great news! I ran both loads through some tests here and everything looks to be working fine now, as far as GUI and REST goes anyway. Well done! I'll be looking forward to your next official release! Enjoy those grandkids! Thanks, David

dbaker1 commented 1 month ago

Hi Mike, I don't know this is important, but I was upgrading another module with your test load and after I loaded the code uploader, I had to power cycle it before I could continue. No other issues were observed. The module was previously running the latest released browser UPG code, same as the other modules I have tested without any issues at all.

nielsonm236 commented 1 month ago

I'm going to have to run through a series of upgrade tests. By design a power cycle shouldn't be needed to perform an upgrade. So I'll have to track that down if I can replicate it. ALSO - I greatly appreciate the testing you are doing, and the thorough result reporting you do. This is a tremendous help to me in understanding how to replicate and gives me clues about where to look in the code. Mike

nielsonm236 commented 3 weeks ago

I've been testing and haven't replicated the "need to power cycle" problem. I've only been upgrading and downgrading between the 20231009 build to the test build, so maybe the problem has something to do with upgrading from an even older build. And so far I've only tested with the Browser UPG build.

A confession here is that I did a bunch of code work during the past few months for some requestors that stopped responding when it came time to test their requests. Thus that work was never completed. UNFORTUNATELY, I did a poor job of using the Github tools to properly check out / check in code for those builds. FORTUNATELY, the added code is all (or mostly) contained by compiler directives so that I can allow or disallow it as needed. But it is possible a bug has slipped in somewhere.

I'll expand my testing to other build types to see what happens. Over the history of the project when changing between build types (say, from Browser UPG to HA MQTT UPG to Domoticz UPG) sometimes there will be an issue. This is typically due to use of parameters in one build type that get stored and may confuse another build type. I have a section of code that runs at boot time that is supposed to validate various settings is these scenarios, but there is always the possibility I missed something.

Sorry for how long this is taking.

Mike

dbaker1 commented 3 weeks ago

Hi Mike, I understand. The only upgrade I tested was from the latest UPG release to the test load. If you would like me to test another load just let me know. Although I'm only setup to test the REST/GUI Browser functionality. Regards, David

nielsonm236 commented 3 weeks ago

I've completed testing and will start the release process. No changes required from the test code you received, other than applying a release version number.

nielsonm236 commented 2 weeks ago

Addressed in release 20240612 0226