motoz / PellMon

Logging, monitoring and configuration for pellet burners.
27 stars 14 forks source link

give up and return fail #92

Open japlab opened 6 years ago

japlab commented 6 years ago

I just get my new usb2serial adapter installed but have the same fail as on the other two i try out. I don’t think it could be the adapter any more that cource the fail. Anyway, pellmon is working perfect, only this anoying fail in the log. Anything to try out.

I'm running, NBE 4.99F

pellmon11

motoz commented 6 years ago

Probably not the adapter then. Maybe the burner controller hardware or the particular firmware version 4.99F or possible the cable picks up electrical interference from something. When I was running a V6 controller I had a timeout maybe once of twice in a week at most, so the controller doesn't always respond. I have never seen or heard about that much timeouts so something is different in your setup, but I have no real clue about what it might be, unfortunately.

japlab commented 6 years ago

Hi Anders. Thanks for the suggestions. I now tried to use shielded cable (5meter) and connect shield to real ground in one end. No improvement. There is still at least 6 timeouts every 24h. I think the controller with firmware 4.99F is way overloaded. I noticed the 7 segments digits on front flicker when connected to serial interface. I'll try to dig into the python code and look at the NBE protocol manegement. I soon report back. BR Jan

motoz commented 6 years ago

Yes it might be just overloaded, I never had stokercontrol so I don't know how often they poll 'officially'.

The code is in the 'scotteprocol' module. Maybe you could try to add a small delay before the retry about here: https://github.com/motoz/PellMon/blob/3e2511ada9e2cd5715c5c9ad67f45a41ef8d82f3/src/Scotteprotocol/protocol.py#L299 just to see if that helps.

(And also note that that code was more or less my firt python code ever while trying to learn the language so I'm not very proud of it..)

The reads are triggered by polling the internal database from several different places, this one is quite fast (2s delay between polls): https://github.com/motoz/PellMon/blob/3e2511ada9e2cd5715c5c9ad67f45a41ef8d82f3/src/Pellmonsrv/pellmonsrv.py#L106

Thomsus commented 6 years ago

Hi, I have had the same issue for some time now. I tried replacing the serial to usb device, but the issue continues. Previously it only occurred once in a while, where I solved it by restarting my raspberry. That's not the case anymore :-( I have chip version 6.87. Being overloaded? Well maybe.

Can something be reconfigured in pellmon to lower the poll intervalls, or any other ideas?

motoz commented 6 years ago

Thomsus, do you mean that your controller stops responding completely?

Thomsus commented 6 years ago

Hi Motoz. Pellmon works, as it shows the graphs, updates the vector image, sends mails etc. But I get a lot of timeouts. image

motoz commented 6 years ago

Can you give a little more details on this: "where I solved it by restarting my raspberry". Did restarting the raspberry eliminate the timeouts for a while? For how long? Did you restart only the raspberry and not the controller? And now the timeouts occur more frequently and restarting the raspberry does not make a difference any more?

Anything else? Like upgraded raspbian, upgraded pellmon, installed something else on the raspberry?

Thomsus commented 6 years ago

I tried to make the story short, but understand why you ask for more :-) I had an old raspberry pi with Pellmon running for a few years, that suffered this occasional timeout issue. It was so seldom that I simply rebooted the raspberry for this to get fixed. Simply restarting the two Pellmon services didn't do it. One day the SD card suffered a crash. Yeah, should have made a backup, I know :-) So I brought a new raspberry pi 3, a new micro sd card, and made a clean install, not using any of the old stuff (DB etc). I don't remember if it had the timeout issue from the beginning, or if it started shortly after. But it's been there ever since, for a couple of months now.

I just upgraded and rebooted it. At first it detects some parameter changes that shouldn't have changed: Parameter 'feeder_capacity' changed from '70' to '870' (hasn't been changed for a looong time) Parameter 'boiler_temp_set' changed from '799' to '70' (799?? It should have exploded then :-) ) 'mode' changed from '651' to 'Running' 'alarm' state went from '18' to 'ok' (I have no alarm called 18)

Having been running for 11 minutes, the timeout issues started again.

motoz commented 6 years ago

I still don't understand what difference the reboot made, if you only had an occasional timeout every now and then, why reboot?

japlab commented 6 years ago

My two cent info: Last year I was installing pellmon on a RPI v3 and changed to a older RPI v2 and boot without GUI. Just by swapping the SD card. Easy. This setup has worked for several month except for the timeout issue. After trying tree different USB2Serial adapter, I decided to try changing back to RPI v3. I was forced to do a total reinstall of Rasperian and pellmon but with backup conf files. I noticed a longer run without timeout complains, but It began after a while again.... Please see the log screen photo. Im In doubt about the clock is 12 or 24h ! I can't find any other logs to look into. I thought of something with the SD card RW. timeout

Thomsus commented 6 years ago

@Motoz: The reboot made Pellmon work fine for months (no timeouts).

japlab commented 6 years ago

Was running for several day without any problem. The log was not giving any warnings, only the 7 lines showing above dated "2018-03-18 04-10-10.53.634 ...... " Probably because I use the Raspberry v3, the fastest one. !

Today the server was stopping and I can't restart it. I get this messages: Are the SD card weared out !

The server encountered an unexpected condition which prevented it from fulfilling the request.

Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in call self.body = self.oldhandler(*args, *kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in call return self.callable(self.args, **self.kwargs) File "/usr/lib/python2.7/dist-packages/Pellmonweb/pellmonweb.py", line 661, in index parameterlist = dbus.getFullDB(['','','','','']) File "/usr/lib/python2.7/dist-packages/Pellmonweb/pellmonweb.py", line 189, in getFullDB raise DbusNotConnected("server not running") DbusNotConnected: server not running

motoz commented 6 years ago

I have always used a B+ model (not even a raspberry 2 that is), so a raspberry 3 shouldn't make a difference for pellmon itself, but maybe there is some other software running that causes trouble. Do you run a full raspbian or raspbian lite? I have always used raspbian lite since I have no use for the extra software that comes with the full version (which is gigantic these days...). And I'm still on rasbian jessie btw, are you on the latest (stretch I beleive it's called)?

Regarding the server (pellmonsrv) fail, if you start it in the forground according to https://github.com/motoz/PellMon/wiki/Contributing-to-PellMon you should at least see some crash info if it crashes.

motoz commented 6 years ago

I dug out my old V6.33 controller box yesterday and set it up (without a burner connected) with an old raspberry B with 512M ram running raspbian (lite) jessie. So far not a single timeout. Which raspbian version are you running (jessie or strech), full or lite?

Have you installed anything else on top of a default raspbian install apart from pellmon?

Thomsus commented 6 years ago

Anders, would you like me to start my own thread with my issue? I know I'm the one who interfered with jablab's issue, as it seemed similar to mine.

japlab commented 6 years ago

I struggle with installing Stretch Light, so I ones again, installed Stretch Full on a new SD card. Pellmon v0.7.1-alpha1. The time stamp show a god run for 18 hour, then it begin to do timeout. Btw, it dont't crash, so I live fine with it. I'm still very happy with the performance. To Thomsus, you are welcome to interfere here in this thread. More input mean better chance to solve the trouble. NOte: I have do a clean install of Rasperian full and done a "apt-get upgrade". No VNC, no SSH, just enabling Serial and OneWire.

2018-03-19 07:33:06,514 - INFO - Timeout again, give up and return fail 2018-03-19 07:11:02,053 - INFO - Timeout again, give up and return fail 2018-03-19 07:00:00,031 - INFO - Timeout again, give up and return fail 2018-03-19 06:37:55,728 - INFO - Timeout again, give up and return fail 2018-03-19 06:18:35,355 - INFO - Timeout again, give up and return fail 2018-03-18 12:44:40,308 - INFO - copied /var/lib/pellmon/nvrrd.db to /mnt/ramdisk/rrd.db 2018-03-18 12:44:40,211 - INFO - Activated plugins: ScotteCom, SiloLevel, Consumption, Cleaning, Onewire 2018-03-18 12:44:40,191 - INFO - chip version detected as: 4.99F 2018-03-18 12:44:40,116 - INFO - serial port ok 2018-03-18 12:44:39,894 - INFO - Looking for plugins in /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins 2018-03-18 12:44:39,893 - INFO - starting PellMon

motoz commented 6 years ago

Japlab, what exactly do you mean by 'enabling Serial'?

And Thomsus, are you also on full raspbian stretch or something else?

japlab commented 6 years ago

In "raspi-config" I chose OneWire and Serial. I now see that the Serial should't be needed since it run over USB, or !

motoz commented 6 years ago

Ok, the serial console used to be enabled by default, and having it enabled should not cause any problems anyway.

I suppose I have to download and install the full raspbian stretch at some point and check out if that makes any difference. I think the full raspbian nowadays starts the graphical desktop by default, that of course uses a lot of ram and probably also cpu resources, even if it's not used. I would recommend doing the pellmon installation and setup over ssh, then you never need the graphical desktop and the 'lite' version works just the same as the full.

motoz commented 6 years ago

Now it has been running for 16 hour on the full raspbian stretch without a single timeout. Too early to say anything certain but it doesn't look like it's related to the raspbian version.

Maybe you could both list exactly which usb-serial adapters you have tried to see if there is any semblance. Output of lsusb, any identification printed on the case, and such.

Mine is listed as ID 067b:2303 Prolific Technology, Inc. PL2303 Serial Port

Another possibility is too weak power supply. It's recommended to use a 2.5A power for a raspberry3.

Thomsus commented 6 years ago

Hi Motoz. My Raspberry pi is running "Raspbian GNU/Linux 9 (stretch)" USB says: "Bus 001 Device 004: ID 067b:2303 Prolific Technology, Inc. PL2303 Serial Port"

Power supply you say? Well, it was using an official 2,5 A raspberry PSU. Swapped it with another, but no difference. When I restarted the raspberry, before the log was filled with error messages, I noticed it said: INFO - invalid setting for plugin_dirs

So I enabled debugging, and checked the log. It seems something is missing: 2018-03-21 10:43:54,853 - DEBUG - error in [pollvalues]: d11 missing, written as undefi$ 2018-03-21 10:43:54,854 - DEBUG - error in [pollvalues]: d12 missing, written as undefi$ 2018-03-21 10:43:54,854 - DEBUG - error in [pollvalues]: d13 missing, written as undefi$ 2018-03-21 10:43:54,854 - DEBUG - error in [pollvalues]: d14 missing, written as undefi$ 2018-03-21 10:43:54,855 - DEBUG - error in [pollvalues]: d15 missing, written as undefi$ 2018-03-21 10:43:54,855 - DEBUG - Polling is on 2018-03-21 10:43:54,857 - INFO - invalid setting for plugin_dirs 2018-03-21 10:43:54,868 - INFO - starting PellMon 2018-03-21 10:43:54,869 - INFO - Looking for plugins in /usr/lib/python2.7/dist-package$ 2018-03-21 10:43:54,875 - DEBUG - PluginManager walks into directory: /usr/lib/python2.$ 2018-03-21 10:43:54,877 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/cleaning.pellmon-plugin 2018-03-21 10:43:54,879 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/scottecom.pellmon-plugin 2018-03-21 10:43:54,881 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/onewire.pellmon-plugin 2018-03-21 10:43:54,883 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/openweathermap.pellmon-plug$ 2018-03-21 10:43:54,885 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/raspberrygpio.pellmon-plugin 2018-03-21 10:43:54,888 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/owfs.pellmon-plugin 2018-03-21 10:43:54,890 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/calculate.pellmon-plugin 2018-03-21 10:43:54,892 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/consumption.pellmon-plugin 2018-03-21 10:43:54,895 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/customalarms.pellmon-plugin 2018-03-21 10:43:54,898 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/nbecom.pellmon-plugin 2018-03-21 10:43:54,900 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/pelletcalc.pellmon-plugin 2018-03-21 10:43:54,903 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/exec.pellmon-plugin 2018-03-21 10:43:54,905 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/silolevel.pellmon-plugin 2018-03-21 10:43:54,907 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/heatingcircuit.pellmon-plug$ 2018-03-21 10:43:54,910 - DEBUG - PluginManager found a candidate: /usr/lib/python2.7/dist-packages/Pellmonsrv/plugins/testplugin.pellmon-plugin

motoz commented 6 years ago

OK, so your adapter is either identical to mine that's working fine, or you have one of those chinese fakes... Your log is completely normal.

I still don't have a single timeout so I can't really do anything about this.

Thomsus commented 6 years ago

Normal log? I wouldn't have guessed so. Anyway, I found this in the log: 2018-03-20 21:49:34,191 - INFO - Timeout again, give up and return fail 2018-03-20 21:49:44,207 - INFO - error in retry [Errno 0] GetItem failed 2018-03-20 22:07:34,769 - INFO - Timeout again, give up and return fail

So what are my options? I can e-mail the log, if that's worth anything.

motoz commented 6 years ago

Yeah, the complaining about plugin_dirs in the log is my mistake, not having plugin_dirs in the conf at all shouldn't generate a warning as it's the normal case. The errors in debug mode for missing stuff in [pollvalues] is likewise not really an error, and since a couple of years the normal case as I added a couple of unused columns to the standard rrd definition to make it easier to add data later. DEBUG mode is anyway only useful when you are debugging, ie. following up what happenes by reading the code.

The log wouldn't give anything, I have tried to replicate your systems as far as I can without success, so there is nothing I can do. Have any of you tried to run pellmon in debug mode as described in the 'Contributing to pellmon' page? If you are able to make changes to the code and test them then I could make suggestions what to try.

rcapion commented 6 years ago

Hi Thomsus, Have you tried with another com-cable and checked for a good connection on the terminals on the print of the controller? I used to have timeouts from time to time but it is gone after I improved the connection. The wires in my COM-cable it extremely thin and to begin with I had only put the bare wires into the terminals on the controller. I then tinned the exposed parts of the wires which made it a bit thicker and stronger. I think that this helped to get a better connection.

Best regards Richard

Thomsus commented 6 years ago

I have checked the cable connections. All wires are tightly fitted in the terminals, on the print. My issue really exploded after re-installing the raspbian lite with pellmon. So I don't suspect the cable fitting. What a shame the old card suffered a data loss, otherwise it would probably still be running.

with no actions for me to try out, I will leave it running as is. It's a bit annoying seeing the many error messages, but I can live with it.

japlab commented 6 years ago

I clone the project, what to do next. I'm lost, and dumb.

pi@raspberrypi:~/PellMon $ ./autogen.sh ./autogen.sh: line 1: autoreconf: command not found

This is content of autogen.sh file:

autoreconf --install || exit 1

update 10 min. later ;-)

pi@raspberrypi:~/PellMon $ ./autogen.sh ./autogen.sh: line 1: autoreconf: command not found pi@raspberrypi:~/PellMon $ cd src pi@raspberrypi:~/PellMon/src $ ./pellmonsrv start error reading config file 2018-03-22 21:35:42,625 - INFO - invalid database definition, data polling not possible 2018-03-22 21:35:42,626 - INFO - Invalid poll_interval setting, using 10s 2018-03-22 21:35:42,627 - DEBUG - Polling is off noconf No section: 'plugin_settings'

motoz commented 6 years ago

I see the the 'contributing to' guide didn't mention installing the dependencies. As you already installed PellMon using the .deb package, your system package manager already installed them for you. However one is missing, namely the one listed in the Readme under 'Additional dependencies for building'. Install that with sudo apt-get install autoconf and you should get further. Don't forget to shut down the installed pellmon with sudo service pellmonsrv stop and sudo service pellmonweb stop before running from the source code.

I updated the 'contributing' guide. It's good to get feedback about where you get stuck so I can improve the guide.

motoz commented 6 years ago

And when you get that far, you should probably use pellmonsrv debug instead of pellmonsrv start to start it. That way it won't become a background process so you can see anything it 'prints' or if it crashes you can see the error information.

Since you are not debugging pellmonweb, you can start that in the background with ./pellmonweb -D before starting pellmonsrv, if you need it running.

japlab commented 6 years ago

Thanks. I now try and get this:

pi@raspberrypi:~/PellMon/src $ ./pellmonsrv debug
No option 'plugin_dirs' in section: 'plugin_settings'
/home/pi/PellMon/src/Pellmonsrv/plugins/onewire/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/nbecom/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/pelletcalc/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/scottecom/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/customalarms/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/exec/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/cleaning/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/silolevel/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/consumption/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/raspberrygpio/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/owfs/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/calculate/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/heatingcircuit/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/testplugin/__init__
/home/pi/PellMon/src/Pellmonsrv/plugins/openweathermap/__init__
^Z
[5]+  Stopped                 ./pellmonsrv debug
motoz commented 6 years ago

Great, then you are all set. Btw, don't use ctrl-Z to break. It doesn't quit, it only puts the process to the background. Use ctrl-C to quit.

Do you have the web server running in the debug setup also? That's strictly not necessary, you can view the log directly also from src/dev_var/log/pellmonsrv/pellmon.log

I suggested a couple of things you could try in the beginning of this thread, do you need more pointers? I don't know how familiar you are with python or programming in general.

japlab commented 6 years ago

Little experienced with delphi, bascom, arduino, java, but nothing in python. webserver is not running. I can see the log:

2018-03-23 12:21:33,121 - INFO - serial port ok
2018-03-23 12:21:33,200 - INFO - chip version detected as: 4.99F
2018-03-23 12:21:33,220 - INFO - Activated plugins: ScotteCom, SiloLevel, Consumption, Cleaning
2018-03-23 15:07:26,648 - INFO - invalid setting for plugin_dirs
2018-03-23 15:07:26,650 - INFO - starting PellMon
2018-03-23 15:07:26,650 - INFO - Looking for plugins in /home/pi/PellMon/src/Pellmonsrv/plugins
2018-03-23 15:07:26,895 - INFO - serial port ok
2018-03-23 15:07:26,972 - INFO - chip version detected as: 4.99F
2018-03-23 15:07:26,991 - INFO - Activated plugins: ScotteCom, SiloLevel, Consumption, Cleaning
2018-03-23 15:08:50,519 - INFO - Timeout again, give up and return fail
2018-03-23 15:10:24,919 - INFO - Timeout again, give up and return fail
2018-03-23 12:21:33,121 - INFO - serial port ok
2018-03-23 12:21:33,200 - INFO - chip version detected as: 4.99F
2018-03-23 12:21:33,220 - INFO - Activated plugins: ScotteCom, SiloLevel, Consumption, Cleaning
2018-03-23 15:07:26,648 - INFO - invalid setting for plugin_dirs
2018-03-23 15:07:26,650 - INFO - starting PellMon
2018-03-23 15:07:26,650 - INFO - Looking for plugins in /home/pi/PellMon/src/Pellmonsrv/plugins
2018-03-23 15:07:26,895 - INFO - serial port ok
2018-03-23 15:07:26,972 - INFO - chip version detected as: 4.99F
2018-03-23 15:07:26,991 - INFO - Activated plugins: ScotteCom, SiloLevel, Consumption, Cleaning
2018-03-23 15:08:50,519 - INFO - Timeout again, give up and return fail
2018-03-23 15:10:24,919 - INFO - Timeout again, give up and return fail

I dont get any debug information. Is there any print statement in the code, or do i need to enable this.

motoz commented 6 years ago

No it doesn't print anything, you need to add print statements yourself where you want the

But first you could try to increase the 2s poll interval to see if that helps (see above, assuming you still see timeouts in the log of course). Then for instance a time.sleep(1) before the retry as I suggested above, if the controller isn't ready to answer then maybe the retry is too quick.

japlab commented 6 years ago

I'm still trying to understand the semantic: When applying a logger.info() in Protocol.py, I get the value in the log file. But annoying to open and close the log file to get updated data. Is it possible to get it in the terminal, realtime.! Tryed with the logger.debug() but nothing happen. I'm losing my hair ;-)

motoz commented 6 years ago

You can use the 'print' statement to print directly to stdout, ie. the terminal, eg:

                    if not result:
                        print 'Timeout, retrying...'           
                        logger.debug('Retrying')
                        try:
                            self.ser.flushInput()
                            logger.debug('serial write')

The logger is for writing to the log file, by default only 'info' messages. if you want to see the debug messages you need to configure the logger for that, see pellmon.conf.

You can also have the end of the log file scrolling in another teminal using eg:

watch tail src/dev_var/log/pellmonsrv/pellmon.log

(And don't forget that python has 'semantic whitespace', ie make sure the indentations line up to form code blocks and don't mix spaces and tabs, pellmon uses spaces for indentation)

japlab commented 6 years ago

Great, thanks a lot. I use "watch tail". Works great. I'll try different things out, to see which medicine is working. BR Jan

motoz commented 6 years ago

Another possible way to search for the problem is to start smaller. Shut down pellmonsrv and pellmonweb and try something like this:

from Scotteprotocol import Protocol
import time

protocol = Protocol('/dev/ttyUSB0', 'auto')

while True:
        print protocol.getItem('boiler_temp'), protocol.getItem('time_minutes'), time.time()
        time.sleep(0.1)

This will print boiler_temp, time_minutes and the system time until it errors out on the first timeout.

Or like this to only print the timeouts:

from Scotteprotocol import Protocol
import time

protocol = Protocol('/dev/ttyUSB0', 'auto')

while True:
    try:
        a = protocol.getItem('boiler_temp')
        b = protocol.getItem('time_minutes')
    except IOError:
        print 'Timeout', time.time()