Errorcontrol: Send data only when certain conditions are met?

I've been using your collect script for a number of months now to track the temperatures in a ventilation system and it work quite well but there is one problem that I've not been able to solve so far. And that is that for whatever reason, the script is sometimes receiving faulty data from the arduino (and by faulty data I don't just mean the values wrong field names).

Based on what I can tell from the log, what seems to happen is that for some reason one line of data is not completely transmitted (or received) so that the subsequent transmission is appended to the incomplete one. The subsequent transmission is also truncated but at the beginning, and I have no idea what is causing this (I'll put an example at the end of this post to make it easier to understand, but I don't think we need to get into the details). Could it have something to do with the scripts read timeout? Mine is set to 80 and the Arduino is transmitting two lines every 60 seconds (one right after the other)...

Anyway, I'm willing to accept that transmission sometimes fails and that I will lose those data points, so I just want those to be thrown away and not sent to Influxdb at all. How can I get the parser in the script to only send data to influx under certain conditions?

My idea would be that I include a code at the beginning of the line as well as at the end and the script should only send that line to influxdb if those codes match. If not, just discard the received data and wait for the next line. But I'm not sure whether and how this could be implemented...

Here is an example from the logs where I have highlighted the faulty field name (CR0entstate) and which the script "correctly" transmits to influxdb:

DEBUG:root:Received line 'FTX,status=testing elapsed=60001i,t0=-10.20,t1=14.30,t2=19.40,t3=-2.55,rh0=100.00,ah0=2.3,rh1=70.20,ah1=8.6,rh2=54.60,ah2=9.1,rh3=93.00,ah3=3.8,dew0=-10.2,dew1=8.9,dew2=10.0,dew3=-3.5,h-eff=92.8,t-eff=82.8,hum-gain=-7.88,CR0entstate=1,brightness=28i,desiredventstate=1,error0=0i,error1=0i,error2=0i,error3=0i,error4=0i,error5=0i,lederror=0i,timerup=2981332170i,timerdown=0i,timerhigh=0i,old_t_vals0=0i,old_t_vals1=0i,old_t_vals2=0i,old_t_vals3=0i,old_h_vals0=0i,old_h_vals1=0i,old_h_vals2=0i,old_h_vals3=0i,sensorerrors0=772i,sensorerrors1=1316i,sensorerrors2=1236i,sensorerrors3=1348i,h-index=5,end=2\r\n' DEBUG:root:Sending lines: ['FTX,status=testing elapsed=60001i,t0=-10.20,t1=14.30,t2=19.40,t3=-2.55,rh0=100.00,ah0=2.3,rh1=70.20,ah1=8.6,rh2=54.60,ah2=9.1,rh3=93.00,ah3=3.8,dew0=-10.2,dew1=8.9,dew2=10.0,dew3=-3.5,h-eff=92.8,t-eff=82.8,hum-gain=-7.88,CR0entstate=1,brightness=28i,desiredventstate=1,error0=0i,error1=0i,error2=0i,error3=0i,error4=0i,error5=0i,lederror=0i,timerup=2981332170i,timerdown=0i,timerhigh=0i,old_t_vals0=0i,old_t_vals1=0i,old_t_vals2=0i,old_t_vals3=0i,old_h_vals0=0i,old_h_vals1=0i,old_h_vals2=0i,old_h_vals3=0i,sensorerrors0=772i,sensorerrors1=1316i,sensorerrors2=1236i,sensorerrors3=1348i,h-index=5,end=2 1613287276238050816']

Up until CR0, the data is part of the first line of data. What comes after that seems to be the end of the second lone of data. Here is an example of what a correct transmission of both lines would look like:

DEBUG:root:Received line 'FTX,status=testing elapsed=60014i,t0=5.90,t1=18.60,t2=21.10,t3=9.90,rh0=100.00,ah0=7.2,rh1=54.70,ah1=8.7,rh2=49.50,ah2=9.1,rh3=91.00,ah3=8.5,dew0=5.9,dew1=9.3,dew2=10.1,dew3=8.5,h-eff=78.2,t-eff=83.6,hum-gain=-6.74,CR0-1=-1.00,CR0-2=-0.20,CR2-1=0.00,CR2-2=-0.00,end=1\r\n' DEBUG:root:Sending lines: ['FTX,status=testing elapsed=60014i,t0=5.90,t1=18.60,t2=21.10,t3=9.90,rh0=100.00,ah0=7.2,rh1=54.70,ah1=8.7,rh2=49.50,ah2=9.1,rh3=91.00,ah3=8.5,dew0=5.9,dew1=9.3,dew2=10.1,dew3=8.5,h-eff=78.2,t-eff=83.6,hum-gain=-6.74,CR0-1=-1.00,CR0-2=-0.20,CR2-1=0.00,CR2-2=-0.00,end=1 1617568674850982912'] DEBUG:root:Received line 'FTX_log ventstate=1,ledventstate=1,brightness=26i,desiredventstate=1,error0=0i,error1=0i,error2=0i,error3=0i,error4=0i,error5=0i,lederror=0i,timerup=1911202681i,timerdown=0i,timerhigh=0i,old_t_vals0=0i,old_t_vals1=0i,old_t_vals2=0i,old_t_vals3=0i,old_h_vals0=0i,old_h_vals1=0i,old_h_vals2=0i,old_h_vals3=0i,sensorerrors0=840i,sensorerrors1=780i,sensorerrors2=828i,sensorerrors3=896i,h-index=0,end=2\r\n' DEBUG:root:Sending lines: ['FTX_log ventstate=1,ledventstate=1,brightness=26i,desiredventstate=1,error0=0i,error1=0i,error2=0i,error3=0i,error4=0i,error5=0i,lederror=0i,timerup=1911202681i,timerdown=0i,timerhigh=0i,old_t_vals0=0i,old_t_vals1=0i,old_t_vals2=0i,old_t_vals3=0i,old_h_vals0=0i,old_h_vals1=0i,old_h_vals2=0i,old_h_vals3=0i,sensorerrors0=840i,sensorerrors1=780i,sensorerrors2=828i,sensorerrors3=896i,h-index=0,end=2 1617568675250469888']

This looks to me like a glitch on the serial line, but quite a long one (in the number of characters). It looks like the transmission is interrupted for a period of time somehow. What's your baud rate? And just to be sure, how long is your serial cable? (A longer cable is more prone to errors.)

I've been working on allowing a custom function for reading data, see the branch below. Perhaps you could use it to validate the data. https://github.com/ppetr/arduino-influxdb/compare/geiger

This looks to me like a glitch on the serial line, but quite a long one (in the number of characters).

I don't quite understand what you mean. By "glitch on the serial line" do you mean it's a transmission mistake on the Arduino? And by "quite a long one", are you suggesting I should keep transmission shorter and rather send multiple lines? I guess I could do that but that would entail that the measurements would have different timestamps in influxdb and I'm not sure what the downsides of that might be.

I've kept the baud rate at 9600 precisely because I wanted to avoid any problems with high baud rates. There is no hurry with sending these data.

The USB cable is some standard-length, maybe 1 or 1.5 meters, not more.

I've been working on allowing a custom function for reading data, see the branch below. Perhaps you could use it to validate the data.

Are you referring to this?

https://github.com/ppetr/arduino-influxdb/blob/d3247b248dbb0626a002ce22cad9bc30f53bc067/collect.py#L123-L126

Could you explain a bit more how I would integrate such a custom function and what it would have to do? (I'm not a programmer and I don't know python.) Do I undestand correctly that it would be that function's job to read the serial data, process it and then pass it on to collect.py? For my purposes, this seems a bit too far upstream. I'm happy to let your script read the lines and then process them.

But even such a less demanding function I would be unsure how to write it. Maybe if I had an existing function that does something similar, I might be able to adopt it, but writing it from scratch, I wouldn't know where to start.

Let's say I'd make the arduino start each transmission with a code (derived from millis or something) and then end the transmission with the same code. The function would then have to check whether the first and last x characters of the received data are the same. If they are, remove those characters and pass it on to collect.py, if not discard it.

Perhaps an easier way of doing it would be that I let the Arduino stick to the line format and send a dummy variable called startID=... and the endID=. The script would then just have to compare the values of startID and endID and if they match just pass on the entire line.

It just occurs to me that I could simply send each line of data twice... It seems as if this generic solution could be useful for error control more generally.

Would it be possible to add an option like --double-data that will compare any line with the previously received line and calculate the Levenshtein distance between the two. By default, when this option is activated, the current line is processed further only when the Levenshtein distance is 0, but a different tolerance value could be passed with the option. There is even a python module: python-Levenshtein, but if you want to keep it real simple, == would be less flexible but probably do the job as well.

That baud rate and cable length should be quite resilient to any errors like that. I think we can safely rule this out.

I'm not in favor of doubling data or similar ad hoc solutions. Usually they're not 100% successful anyway, and only mask the actual issue. Let's try to figure out what the cause is.

Another option might be that your PC for some reason can't process data quickly enough. But that'd only occur under a very high load, or when the Python application gets swapped out by another process. Could this be happening?

You could also try to enable serial parity bit on both devices to make the communication more resilient.

Finally, I think option inter_byte_timeout would also help here: It should detect if bytes stop coming from the serial line when a line is being transmitted, signal an error and abort that line. While the line will be lost, it won't attempt to pass incorrect/incomplete data to InfluxDB. I can add support for this option, that should be pretty straightforward.

I'll add an example for such a serial function later, I'll need to test it to make sure it works.

Wow, you're amazing! I like you're attitude of getting at the root-problem. Countless hours on support hotlines or chats taught me to go for the minimal possible solution (aka workaround) because that's all you can usually get. Not engineeringly satisfying but lets you get on with life. So it's really great to have you push in the other direction.

Another option might be that your PC for some reason can't process data quickly enough. But that'd only occur under a very high load, or when the Python application gets swapped out by another process. Could this be happening?

I'm not sure how exactly to investigate that properly, but here is an approximation:

Here is the graph with (some of) the measurements from the last 24 hours (as shown in Grafana, which gets its data from Influxdb):

The vertical black lines are missing data which are likely caused by the data transfer issue we're discussing here. The most striking thing, is, of course, that they appear at regular intervals of about 3.5 hrs (I hadn't noticed this yet):

00:13 03:44 07:11 10:37 14:07 17:37 21:05

Here is the load diagram from the server on which your script is running:

The graph is not as fine grained but enough no see that there is no match between any of the outages and high CPU load.

So let's look at network traffic (for no particular reason, just because I have it):

Nothing here either.

The machine has 24 GB of RAM, so that shouldn't be an issue either.

Not sure if this means your hunch was wrong but to me it looks like the best clue we have is the 3.5 hour interval, though it strangely isn't exactly 3.5 hrs. I am not aware of any scheduled jobs running every 3.5 hrs so its more likely something that happens to occur at that time interval not because of the time intervall is up but because something else happens to occur every 3,5 hrs, like some overflow or something.

The outage pattern is not as clean every day. Here is yesterday's graph:

The regular outages are still visible but there are also some much larger outages.

I don't have a CPU graph for yesterday but the one for the last week doesn't indicate that the big outage yesterday has anything to do with CPU:

You could also try to enable serial parity bit on both devices to make the communication more resilient.

How do I do that (on arduino and on debian)?

The inter_byte_timeout option sounds good to me.

ppetr / arduino-influxdb

Errorcontrol: Send data only when certain conditions are met? #8