Closed johnny2678 closed 3 years ago
I’ll get some time this weekend to work on this. We clearly have a Mixing/Monitoring conflict here.
I noticed I also couldn't cancel the dosing today.
https://user-images.githubusercontent.com/1646840/120850682-9a20b600-c52c-11eb-9585-3511b10fadf7.MP4
Just did some work on making sure the promises return if we catch an error.
checked out the latest to see if anything had changed.
Still seeing two issues but guessing they are related.
Issue # 1: mixing cannot be cancelled, specifically the auto-mixing period after an acid dose.
to reproduce: click the "Stop Mixing" button on the chem controller in dashpanel while a mixing timer is counting down.
replay: replay (3).zip
Issue # 2: manual mixing period countdown timer starts bouncing back and forth between 2 countdown timers
to reproduce: while the chem controller is not mixing and not dosing, initiate a manual "Mix Acid", I did one hour. Timer will start counting down from one hour and then start bouncing back and forth between one hour and 10 minutes (my auto mix period is set for 10 minutes). Stopping the acid mixing after ~30 seconds cancels the 1 hour mix but leaves a 10 minute timer running that cannot be cancelled (see issue # 1 above).
replay: replay (4).zip
let me know if this makes any sense or if I can provide any more information.
I have not experienced either of these occurrences. I have added some checks to make sure any delay between the time the mix begins to be cancelled and the time all the parties are notified (including Hubitat) will not allow the mix timer to be recreated. The primary difference between your REM Chem and mine is that there is no Hubitat interface attached. Are you currently supporting REM Chem on Hubitat?
hmm, I'm only using Hubitat to turn the pool lights on and off. I might just blow away that interface and use MQTT but it's going to take some rejiggering on my end. Will report back.
No need to do that for this issue. Pull njsPC/REM/dashPanel. I made it so it knows that the mix cycle cannot create a new one if one already exists.
too late ;) I've been meaning to port some of my nighttime automation over to nodered anyways so this gave me the excuse.
Just updated everything - will report back once I can run some more test cases.
ok, first observations - I can start a manual mix period and it no longer looks like it's counting down two timers.... until I hit the cancel button. Then it starts bouncing back and forth between two timers again. Pressing cancel at this point does not cancel the timers. But then pressing cancel 5-10 more times and eventually both mixing timers are cleared.
I know this sounds wonky. I forgot to take a video during my replay so I will try and get all of this captured and post back.
ok, grabbed a recording and replay
steps:
Sorry for the clusterF. Let me know if you want me to re-capture with any other test parameters
edit: sequencing
Ok so I can clearly see what is happening. We have a dance going on where the mix timer is being triggered more than once. I believe this is occurring because the setup time of the initial timer exceeds the one second trigger time when the mix is first initialized. So what it was doing is kicking off another check in the middle of the initialization.
Depending on how tardy the process was this could cause the it to set up the process as many times as it took to finish all of the ones it started. The process is pretty resilient and would work find to simply count down since all the timers would eventually expire. However, if you tried to cancel, the process needed to cancel each one of them individually.
The manual cancel process sets the time remaining to 0, so eventually it would hit at the same time the code clears the mix time. The remaining timers would then pick up the mix time set on the controller and appear to begin an normal automated mix cycle. The reason it eventually appeared to cancel was because you ended up cancelling each timer process that was erroneously created.
What I have done is force njsPC to wait until it is done initializing the process before allowing another process to start. I can't seem to replicate this so while I believe I found all the conditions, I won't know until you pull njsPC and run through your process again.
Unfortunately, almost the same result. Except this time I can't cancel the mixing period at all. I gave up after 2 minutes. Still counting down.
Wow that is an enigma. I am going to solve this mystery.
Sorry if we went through any of this but I am not sure what would possibly allow more than one timer at a time at this point.
What node version are you running node -v
?
Is the connection to REM over wifi or is it ethernet wired?
Can I get you to go into config.json and set enabled=false
on MQTT, and InfluxDB then try again?
pi@tmpPiPool:~/nodejs-poolController $ node -v
v12.20.2
REM is on a Pi by the pool and on wifi. Signal is good.
Results after shutting down MQTT and Influx - (this pained me to lose 5 mins of data ;)
Oddly enough - this worked. Setting a mixing timer counted down without multiple timers and cancelling the timer worked on the first click.
So then I re-enabled Influx - left MQTT disabled - this worked too - replay (9).zip
Then, turn Influx back off and turn MQTT on - back to the bad timer behavior.
So we've narrowed it down to MQTT. But I'm stumped what to do next. MQTT is working fine. I am getting data and I can issue state commands to njspc over MQTT just fine. My brain hurts... 🤣
By chance are you setting anything back to njsPC using MQTT that could be triggering the mixing? For instance in Home Assistant.
Now that I know that the issue is with MQTT I have a path to potentially duplicate the issue but I suspect this is probably a consumer of MQTT that is reading the mixing changes and echoing the command back through the MQTT interface.
hmm... well I only send MQTT state commands to njspc through nodered and I only do so for three reasons:
REM should be controlling the acid dosing cycle completely autonomously (thank you very much!). I will continue to comb through my logic but I can't recall ever tinkering with the acid dosing/mixing cycles once I switched over to REM.
To test, I just shutdown nodered completely and tried setting/cancelling a mixing timer and still couldn't cancel it.
one last thing that might be interesting... I only run my pump for 12 hours a day and it's past quitting time. So after that last test I turned off the pump. The 1 hour test mixing period that I couldn't cancel was still set - I assume that is on purpose so the timer resumes when the pump is turned on tomorrow. However, with the pump off, I could cancel the mixing timer. It went away the first time I clicked it - this is with MQTT/Influx, etc enabled.
So cancelling the mixing timer isn't working with MQTT enabled and the pump running, but does work with MQTT enabled and the pump off... 🤷♂️
Pull njsPC I am trying a hunch.
hmm... didn't see any updates to njspc. When I try and pull, says that I am on the latest?
REM had an update but it was to the AtlasEZO.ts
edit: grammar
Was observing REM complete an acid dosing cycle and I think it's setting two mixing timers when dosing completes. This is just REM/njspc running on autopilot with the dosing/mixing settings set by the chemController, no manual intervention from me.
From the info
log:
1|poolController | [7/14/2021, 8:20:24 AM] info: Dose completed 37.737498mL 0 2.262501999999998
1|poolController | [7/14/2021, 8:20:25 AM] info: Chem Controller begin mixing acid for 10min of 10min
1|poolController | [7/14/2021, 8:20:25 AM] info: Chem Controller begin mixing acid for 10min of 10min
quickly switched to silly
. You can see it's reporting the countdown status of two timers each 1s logging cycle, and the 2nd timer has 2s more time than the first timer:
1|poolController | [7/14/2021, 8:21:09 AM] info: [8:21:09 AM] 192.168.5.146 PUT /app/logger/setOptions {"app":{"level":"silly"}}
1|poolController | [7/14/2021, 8:21:10 AM] verbose: Chem mixing acid remaining: 9min 17sec
1|poolController | [7/14/2021, 8:21:10 AM] verbose: Chem mixing acid remaining: 9min 19sec
from the next 1s logging cycle:
1|poolController | [7/14/2021, 8:21:11 AM] verbose: Chem mixing acid remaining: 9min 16sec
1|poolController | [7/14/2021, 8:21:11 AM] verbose: Chem mixing acid remaining: 9min 18sec
just more data points that I needed to jot down before my brain switches to work mode. It may or may not help.
edit: grammar
Thanks for that. I refactored the whole thing to work just a bit differently. Give njsPC a pull and let me know how it goes.
ugh... same result. Attempting to cancel a manually set timer causes another one to spin up.
@tagyoureit - you mentioned above that you were also unable to cancel mixing - is that still the case? or is my install just borked?
That is very weird that I am not getting the issue. Perhaps this has something to do with the EasyTouch code because I cannot get it to happen on IntelliCenter or on Nixie. I do see what is going on since I went through the log line by line. 10 seconds after the manual mix is started it hits the normal polling for REM Chem. At this point the mix timer is missing so it sets another one. Unfortunately, when you cancel the mix it is only cancelling the one that it knows about and the one in the background is awol.
I added some more messages so could I get you to pull and try to replicate this again.
[7/14/2021, 1:29:43 PM] info: [1:29:43 PM] 192.168.5.146 PUT /state/chemController/manualMix {"hours":1,"minutes":2,"id":50,"chemType":"ph"}
[7/14/2021, 1:29:43 PM] info: Chem Controller begin mixing acid for 1hr 2min of 1hr 2min
[7/14/2021, 1:29:43 PM] verbose: Set acid mix timer
[7/14/2021, 1:29:43 PM] verbose: Chem mixing acid remaining: 1hr 2min
[7/14/2021, 1:29:44 PM] verbose: Chem mixing acid remaining: 1hr 1min 59sec
[7/14/2021, 1:29:45 PM] verbose: Chem mixing acid remaining: 1hr 1min 58sec
[7/14/2021, 1:29:46 PM] verbose: Chem mixing acid remaining: 1hr 1min 57sec
[7/14/2021, 1:29:47 PM] verbose: Chem mixing acid remaining: 1hr 1min 56sec
[7/14/2021, 1:29:48 PM] verbose: Chem mixing acid remaining: 1hr 1min 55sec
[7/14/2021, 1:29:49 PM] verbose: Chem mixing acid remaining: 1hr 1min 54sec
[7/14/2021, 1:29:50 PM] verbose: Chem begin calculating demand: 7.698 setpoint: 7.8 body: 16000
-- WE ARE ALL GOOD TO HERE AND REM Chem is on its normal checks to calculate all of the chemistry
[7/14/2021, 1:29:50 PM] verbose: Chem mixing acid remaining: 1hr 1min 53sec
[7/14/2021, 1:29:51 PM] verbose: Chem mixing acid remaining: 1hr 1min 52sec
[7/14/2021, 1:29:52 PM] verbose: Chem mixing acid remaining: 1hr 1min 51sec
[7/14/2021, 1:29:53 PM] verbose: Chem begin calculating demand: 7.698 setpoint: 7.8 body: 16000
-- REM Chem hit another cycle to calculate all of the chemistry values however it appears that the previous mix data has
-- been cleared and is missing.
[7/14/2021, 1:29:53 PM] info: Chem Controller begin mixing acid for 1hr 1min 51sec of 1hr 1min 51sec
-- REM Chem starts another mix cycle on top of the existing one. Since we are manually mixing it thinks
-- that the missing mix just got interrupted so it continues with the existing mix data.
[7/14/2021, 1:29:53 PM] verbose: Set acid mix timer
-- Since the previous mix is lost it now has a new mix timer and the old one is still in the ether. The only way this can happen
-- is if the NixieChemical object has been destroyed somehow.
[7/14/2021, 1:29:53 PM] verbose: Chem mixing acid remaining: 1hr 1min 51sec
[7/14/2021, 1:29:53 PM] verbose: Chem mixing acid remaining: 1hr 1min 50sec
[7/14/2021, 1:29:54 PM] verbose: Chem mixing acid remaining: 1hr 1min 50sec
-- The above message is from the new timer but the one above that is from the original time.
[7/14/2021, 1:29:54 PM] verbose: Chem mixing acid remaining: 1hr 1min 49sec
[7/14/2021, 1:29:55 PM] verbose: Chem mixing acid remaining: 1hr 1min 49sec
[7/14/2021, 1:29:55 PM] verbose: Chem mixing acid remaining: 1hr 1min 48sec
[7/14/2021, 1:29:56 PM] verbose: Chem mixing acid remaining: 1hr 1min 48sec
[7/14/2021, 1:29:56 PM] verbose: Chem mixing acid remaining: 1hr 1min 47sec
[7/14/2021, 1:29:57 PM] verbose: Chem mixing acid remaining: 1hr 1min 47sec
I added some more messages so could I get you to pull and try to replicate this again.
The way you ask makes it sound like I'm the one inconvenienced here 🤣 Thanks for spending your valuable chasing down an issue that only appears to affect me.
One quick note on this recording. Every once in a while if I hammer the cancel mixing / confirm buttons I can get the timer(s) to clear completely. That happened in this test run. You can see it in the video.
I added even more messages to see why it starts the mix then shortly thereafter the mix object is destroyed. I do know why you are able to finally kill it by hammering the cancel. This has to do with the timers finally getting caught by the cancellation loop.
ok, I couldn't hammer it away this time.
New video/replay with addl messages: https://youtu.be/5heVbhwr_70 replay (13).zip
What do you have at ip address 192.168.5.140?
that's the MQTT server
edit: also influx, grafana, etc - my home lab
edit2: also nodered, which is doing a /state/all/ call every 15 seconds.
Hmmm. Something from that address is requesting the overall state. Is that also where dashPanel is running from?
EDIT: Your edit solved the mystery.
Hmmm. Something from that address is requesting the overall state. Is that also where dashPanel is running from?
it's node red, I request /state/all/
every 15 seconds to:
EDIT: Your edit solved the mystery.
the /state/all
call is interfering with the timer?
I suppose I could be more precise with my calls and call the pump
object for # 1 and # 2 above, and the chlorinator
object for # 3, but would those call interfere with the mixing timer as well?
You know you could use sockets and njsPC will tell you when solar has kicked on, the pump wattage changes, and when the ORP has changed.
Thanks for sticking with this. It is now a mission to find out what could possibly be clearing the current dose object and I am going through all the potential suspects. I added some more logging to pinpoint the exact moment in time that the currentMix object vaporizes.
EDIT: I don't think the state/all call is monkeying with the currentMix object. Perhaps there is something going on where njsPC aborts it in response to an MQTT item though. We WILL find this.
We WILL find this.
Your persistence is inspiring!
So, the sequence happened a little differently this time. I started the manual timer, waited 20 seconds, cancelled it, and it cleared! I was surprised because I thought you were just adding in debug messages. So then I decided to try again, unfortunately 2nd time around it got caught in the same cancellation cycle.
Just wanted to bring that to your attention.
https://youtu.be/ZiG1BWL222k replay (16).zip
edit:
Perhaps there is something going on where njsPC aborts it in response to an MQTT item though
yea, because if you remember I tested with NodeRed shutdown completely and was still getting the stuck timer. It only goes away when MQTT is shut down.
Ok so I made a change that is a supposition. I have all of the things going like you do MQTT, Grafana, Influx, and a couple of other interfaces. However, one thing that is fundamentally different is that I am running IntelliCenter. These panels initialize the Nixie controller differently so I think that we may have been dealing with 1 REM Chem configuration and two Nixie controllers for said configuration. This would not normally have any impact on dosing but the one place where it would have an impact is on mixing.
Pull njsPC and lets hope this was it.
Btw did you yet the socket client figured out for node-red?
This would not normally have any impact on dosing
Exactly, dosing always worked well. Dosed/mixed according to settings until the setpoint was found. It's only when I wanted to pause dosing for a longer period after loading chemicals that I noticed this.
I promise I'll give it a run first thing in the morning. I'm switching out some SATA cables on my unraid server so my docker environment is down at the moment.
Whaaaat? No NVME.
Let me know how it goes.
Whaaaat? No NVME.
I have 2 1TB NVMe drives in this server too - wish everything could just go to solid state. Kept getting CRC errors with one of my array drives. Hoping changing the SATA cords will work.
Let me know how it goes.
This run went pretty much the same as the last time. Setting a timer and clearing it works for one cycle, but then it gets stuck the 2nd time I set a timer.
I realize I'm moving the goal posts. Honestly I don't know when I would set a manual timer, cancel it and set another one. Just letting you know in case that triggers an easy fix.
video/replay:
I have 2 1TB NVMe drives in this server too - wish everything could just go to solid state
Everything can you just need a second mortgage.
I realize I'm moving the goal posts. Honestly I don't know when I would set a manual timer, cancel it and set another one. Just letting you know in case that triggers an easy fix.
This is now my Moby Dick. I need to cut my way out of the belly of this whale if you are still willing. @tagyoureit tried to replicate this issue on his EasyTouch and couldn't seem to make it happen there either. By chance are all of the app dependencies up to date?
By chance are all of the app dependencies up to date?
yep, I run npm i
after every git pull
however I have not updated the rPi in a while apt update && apt upgrade
- guess I could try that.
I keep going back to MQTT. Something about that interface is causing the mix cycle loop, because it doesn't happen with MQTT turned off. On the surface, MQTT working great - getting regular feeds from njspc that I can subscribe to from nodered and other apps.
I do see some socket errors in the MQTT logs that I would need some time to chase down. Might give that a shot.
This is now my Moby Dick. I need to cut my way out of the belly of this whale if you are still willing.
For sure! I just hope it's not something I did (or am doing) that's causing it to behave badly.
For sure! I just hope it's not something I did (or am doing) that's causing it to behave badly.
I am certain that it is something you did. However, I need to make sure we can handle that something and not have a runaway mixing time.
I added a few more checks to put one of the theories to bed. Could you pull njsPC and do what you did again.
Could you pull njsPC and do what you did again.
sure thing. This time, I couldn't get it to cancel.
well crap... I had a hunch that one of my python->MQTT scripts was causing the problem. Sure, enough when I dug into the socket errors in the MQTT server logs, the issue was with one of my python 'clients'.
I shut them down. So the only interfaces to MQTT are polished projects with better devs/socket programming (njspc, Blue Iris, nodered, weewx, etc). Sure enough, killing my python scripts got rid of the socket errors in the MQTT server logs, but unfortunately didn't change the behavior we're seeing with the mixed up mixing timers... even after a njspc restart.
Ok so I added even more stuff. This time I moved the timer definitions around.
Do another pull and lets see if that made a difference. If it did then I think I may have exposed an issue with node closures.
Do another pull and lets see if that made a difference.
Unfortunately no, still can't cancel and getting the 2nd timer.
are the videos helpful? or do you just need the replays?
Hey @johnny2678, We spent a bunch of time looking at it this morning. Unfortunately, nothing conclusive yet.
However, in your latest set of logs that you published we are already seeing two timers running even before you take any actions. We might be missing some thing that's not captured earlier in the logs.
Can you start the packet capture but then restart the app? The capture will still be running when you start up fresh. Then, see if you can re-create the issue and upload all of the logs beginning with the app starting up.
Hopefully there will be an error or another clue that will help us figure this out.
Can you start the packet capture but then restart the app?
sure thing. This should have everything from startup in it.
I skipped the dance of making the video this time, but I started the mixing timer at 6:56:53 PM and tried to cancel mixing at 6:57:20 PM log time.
Let me know if you want me to redo with video. replay (21).zip
It looks like we found Moby Dick! The problem is caused by MQTT sending information on the topic before njsPC initialized all the nixie stuff. The fix is pretty straight forward but I would like you to pull and make sure it goes away then we will be making a hardened fix this weekend.
yep!
I started and stopped a timer twice cleanly with no hiccups!
if you're following along in the logs: timer start 8:04:10 timer stop 8:04:37 timer # 2 start 8:04:45 timer # 2 stop 8:05:10
I have questions, in no particular order:
what is nixie?
What??? You don't know what Nixie is? Lol. njsPC has always been the gateway between the traditional OCP's and browser/mqtt/etc. But of course some people don't have an OCP and just want to control their pumps, chlorinators, heaters, etc. The logic to do that was shoehorned into njsPC but then @rstrouse came up with the brilliant idea to move the logic for controlling anything that the OCP shouldn't be controlling into it's own logic. njsPC is still the glue/broker for all the communications and Nixie has evolved over time to be it's own full fledged OCP. She's smart enough to know when another OCP is in control that she doesn't interfere with that (but she can also extend any OCP by having additional schedules, features, circuits, etc). Nixie is bundled in, and tightly integrated with, njsPC so you'll never know she's there but she is the master of all things beyond what your OCP is doing. https://github.com/tagyoureit/nodejs-poolController/wiki/Nixie-Controller-Setup
if it was a nixie/MQTT sequencing issue, then why did it only affect my setup and not Tags? (or others?)
Simply because you were sending an inbound MQTT to /config/chemController before Nixie was initialized (which happens after njsPC retrieves the full config from the OCP). Once we saw the last set of logs, I was able to easily replicate it with a URI PUT call to the same endpoint. You were the only one impatient enough to not wait for everything to settle down. :)
tell me where to send a tip
Really, it's not necessary... but always appreciated. If you'd like to send a tip (Paypal/Venmo) my email is at the top of most of the source code pages. I'll save you the shipping costs and buy some bourbon myself. ;)
So I am going to answer your questions in that un-particular order but @tagyoureit answered before me.
Q1: Nixie is the equipment manager that acts as the OCP for equipment not already controlled by an OCP. As we started adding more standalone equipment to njsPC somebody needed to be in control so Nixie was created to be in charge of it.
Q2: First @tagyoureit is practically immune to bugs. Hasn't had a mosquito bite in years (or that is the legend anyway). But the reason it appeared to be so random is that it only would show up when the MQTT server had changes for equipment under her purview. So in your case the MQTT server was holding onto information to deliver when njsPC stared up. More specifically it was this. [7/16/2021, 6:56:12 PM] debug: MQTT: Inbound pool/easytouch2-4/config/chemController: {"id":50,"alkalinity":55}
. This is a message from the MQTT server to set the alkalinity on REM Chem.
But at the time this message came through and it was routed to Nixie, she had not been initialized yet. So she happily picked up the message, recognized the missing equipment, and stood up its processes. The reason it came through was because of quality of service rules on MQTT held it in its queue until someone came along and subscribed to the topic. This happened when njsPC started up and subscribed the MQTT server long before Nixie was started up. When she was finally given the go ahead to start, she initialized her storage (which orphaned the equipment she already stood up) and got off to the races.
So the reason why you and not others is that, you have something that is probably sending MQTT messages to the broker while njsPC is turned off. Specifically, it was setting the poolSetpoint for the chlorinator, the state for feature id 15 and circuit 2, and the alkalinity for REM Chem with the latter being the only one that required Nixie to fire up her kingdom early.
Q3: I am a scotch guy so if you happen to have a Macallan 18 floating around raise a glass and take a little sip.
You were the only one impatient enough to not wait for everything to settle down. :)
Buddy, I've only been using your project for ~3-4 years, but we might as well be family if you already know me this well ;)
So in your case the MQTT server was holding onto information to deliver when njsPC stared up.
Interesting... so this will take some explaining about my weekly ritual that probably won't make a lot of sense, but here it goes. My water testing routine is once a week (usually Saturdays). I lug the test kit out to my pool pad and use the extra drain line on my flow cell to run through the various manual chem tests (fc, ph, alk, cya, salt, ch, etc). I use the PoolMath app on my phone to log everything. The app has a json export so I grab the data and push to influx. Been doing this for years. Earlier this year when I setup REM, I realized I could feed njspc these chem values to have it calculate lsi/csi, so I whipped up a flow in nodered to read the PoolMath json, extract the chem values, and send to MQTT so njspc can pick them up.
I set up the MQTT node like this:
with the topic being: pool/easytouch2-4/config/chemController
and the payload: {"id":50,"alkalinity":msg.chem_value}
I'll admit I have not spent too many cycles thinking about MQTT QoS and or Retain settings. My understanding is QoS of 0
would not guarantee delivery and QoS of 2
would, so I set a QoS of 2
. Would a QoS of 0
have avoided this problem?
Also, as I mentioned I only measure chemistry once a week and import results right after, so that ALK value of 55 was delivered, not sure why my MQTT server is still trying to send it every time njspc starts up.
I'll admit I thought the retain
setting just determined weather the latest value for a mqtt topic was stored in the server, so topics that are updated more infrequently (ex. blueiris updating a camera motion sensor) would still be there if I checked hours later using MQTT Explorer or another client.
Between this and the sockets questions on gitter, I feel like I'm abusing the open-source dev / non contributing-consumer relationship. Not your job to teach me tech, but if there's something obviously backwards about my MQTT settings, let me know.
I am a scotch guy so if you happen to have a Macallan 18 floating around
Only bourbon / tequila in my cabinets now. And a few drops from a bottle of Johnny Blue.
Thanks for indulging me on this. This wasn't a big problem but you guys worked it like it was. Hope the project is better for it. Donation incoming.
Describe the bug an (animated) picture is worth a 1000 words. Check out the mixing timer in this gif https://media.giphy.com/media/VKq3DtHdVJ9vFAzvFq/giphy.gif
To Reproduce Start a manual mix (i set for 60mins) watch the timer, it starts counting down from 60 minutes for 5-10 seconds, then starts bouncing back and forth between a 60 minute timer and a 5 minute timer coincidentally, 5 minutes is the mixing period I have set in chemistry settings after each acid dose
Packet Capture Follow the instructions to complete a packet capture and attach the resulting zip/log files.
replay (2).zip
Please let me know if I can provide anything else.