jmicrobe / odin

Repo for documentation related to the Optical Density Instrument at the Stahl lab
MIT License
1 stars 2 forks source link

Froze at "Sending data via Ethernet" #39

Open jmicrobe opened 5 years ago

jmicrobe commented 5 years ago

On 5/20 I came in an saw that the ODIn was frozen (platform was up to take a reading, was not shaking). The last reading recorded by the server was on 5/19/19 at 15:28. Until that point readings were looking as expected.

Here are the most recent lines from run.log, located in /var/www/ODIn_dataServer/

/var/www/ODIn_dataServer/dataServer: at 05/06/19 21:22:26 -0700 - 13000 total packets received httpd: parsing config form setting log output file to 'OD1_100331.txt' httpd: sending config form httpd_send_response() = 1 httpd: parsing config form setting log output file to 'OD1_100332.txt' httpd: sending config form httpd_send_response() = 1 httpd: parsing config form setting log output file to 'OD1_100331.txt.bak' httpd: sending config form httpd_send_response() = 1 httpd: parsing config form setting log output file to 'OD1_100332.txt.bak' httpd: sending config form httpd_send_response() =

Here is some of the data from the system log, located at /var/log/syslog

May 20 06:25:05 raspberrypi rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="381" x-info="http://www.rsyslog.com"] rsyslogd was HUPed May 20 06:25:06 raspberrypi CRON[20499]: (CRON) info (No MTA installed, discarding output) May 20 06:25:42 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:25:42 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:25:50 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:25:50 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:27:28 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:27:28 raspberrypi rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ] May 20 06:27:28 raspberrypi rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ] ... May 20 11:24:49 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 11:26:03 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 11:26:03 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 11:26:55 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 11:26:55 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 11:27:07 raspberrypi systemd[1]: Starting Session c12 of user muninn.

And I noted that there is a syslog.1 file. Here is some of the data from that:

May 19 15:14:00 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:14:00 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:15:30 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:14:43 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:14:43 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:16:18 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:16:18 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:17:48 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:17:01 raspberrypi CRON[20147]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) May 19 15:19:36 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:19:36 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:21:06 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:20:44 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:20:44 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:22:22 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:22:22 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:23:52 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:24:00 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:24:00 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:25:30 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:25:08 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:25:08 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:26:11 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:26:11 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:27:41 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:26:11 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:27:04 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:27:04 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:30:59 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:30:59 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:32:29 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:30:59 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:31:51 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:31:51 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:32:47 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:32:47 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:34:17 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:32:47 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:33:40 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:33:40 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:35:18 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:35:18 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:36:48 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:35:33 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:35:33 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:37:12 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:37:12 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:38:42 2019 [try http://www.rsyslog.com/e/2007 ] May 19 15:38:41 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:39:03 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 19 15:39:03 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Sun May 19 15:40:33 2019 [try http://www.rsyslog.com/e/2007 ] ... ...

May 20 06:24:50 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:24:50 raspberrypi avahi-daemon[379]: Invalid response packet from host 172.28.236.46. May 20 06:25:01 raspberrypi rsyslogd-2007: action 'action 17' suspended, next retry is Mon May 20 06:26:31 2019 [try http://www.rsyslog.com/e/2007 ] May 20 06:25:01 raspberrypi CRON[20506]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )) May 20 06:25:04 raspberrypi systemd[1]: Reloading LSB: Apache2 web server. May 20 06:25:05 raspberrypi apache2[20627]: Reloading web server: apache2. May 20 06:25:05 raspberrypi systemd[1]: Reloaded LSB: Apache2 web server. May 20 06:25:05 raspberrypi rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="381" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

On 5/20 I shut the ODIn down and started a completely new run. So far I have not encountered this error. It seems to me that the main issue is that the ODIn isn't skipping sending data to the server when it can't make the connection, and instead freezes.

jmicrobe commented 5 years ago

Here's the code from the arduino. It seems like there's no "else: write to SD and continue".

Design-wise, it seems like it would be more efficient to allow the ODIn to queue data to send to the server, in cases where the connection is down.

https://github.com/jmicrobe/odin/blob/85af6f162e55142636c5bf7a38eb75d5f6b0a808/System_2.ino#L612-L624

jsebrof commented 5 years ago

The reason there is no else branch to the if statement here is because the boolean variable that is being tested has been hard-coded to true at the top of this source code file.

I'll take a look through the Ethernet library that was used to see if we might have missed anything when we wrote the software.

dacb commented 5 years ago

Those syslog entries, e.g. the avahi don't seem to be related. I don't think that is the IP of minun, right?

One other thing I might recommend is to look to see if the Arduino libraries for the ethernet hardware have been updated since the Arduino software was last built. I wonder if the ethernet send failed and the device froze. It may also be possible to begin/endPacket in a way that is non-blocking.

jmicrobe commented 5 years ago

Those syslog entries, e.g. the avahi don't seem to be related. I don't think that is the IP of minun, right?

@dacb No, the ip is different, but these files came off of the muninn. Any idea why that is? It's on a static IP that the department assigned.

Also as for the Ethernet & libraries - the ODIn is still using the old ethernet sheild (as in, hasn't been replaced since I've started using it) but I did purchase a replacement shield. I just haven't installed it yet. It's an Arduino Ethernet Sheild 2

dacb commented 5 years ago

I think those avahi-daemon[379]: Invalid response packet from host errors are related to network discovery services running on the pi for connecting to things like printers. They can safely be ignored.

jsebrof commented 5 years ago

The Ethernet library that we use I believe comes with the Arduino IDE package of standard libraries, which they publish reference documentation for. I looked through the release notes for newer versions and nothing relevant to our situation popped out at me since the 1.0.6 release that we have been using.

What I suspect happened is that the ODIn software failed on line #621, where the data is being written into the packet to send to the server. The Ethernet library reference documentation for the write() function states that it must "be wrapped between beginPacket() and endPacket()." In the source code it is indeed placed between calls to those two functions, however I suspect that an uncaught error when calling beginPacket() sets up some undefined behavior when the write() function is called, causing the ODIn system to freeze.

A follow-up question then would be, what caused the call to the beginPacket() function to fail? I believe that the DHCP lease that the ODIn system negotiated with the network router expired, and while the ODIn system can use an expired DHCP lease successfully, it can only do so until the router leases the IP Address to another user, at which time I suspect the calls to the beginPacket() began to fail.

In the Ethernet library reference documentation I noticed the existence of a maintain() function, which we do not currently use in our source code. Its documentation warns that this function needs to be called on a sub-second interval in order to successfully renegotiate the DHCP lease before the network router leases the IP Address to someone else. Internally the software keeps track of when the lease expires so calls to the maintain() function do nothing unless it is time for DHCP renegotiation.

Unfortunately, the data sampling section of the loop() function contains multiple delays which can potentially provide enough of a gap for potentially bad network behavior to occur. I'm still devising the best path forward with regard to solving this issue, but I'll make some pull requests soon.

jsebrof commented 5 years ago

Pull request #41 contains changes that should help make the Ethernet connection more resilient. Note that these changes have not been tested, as I lack the hardware to do so. More specifically, these changes check for error returns on all Ethernet function calls where errors are returned, adds a call to the Ethernet.Maintain() function, and if that call is unsuccessful, attempts to reconfigure the Ethernet. This reconfiguration might cause the Arduino sketch to exceed the size limitation for the Arduino Mega, so watch for whether that becomes an issue. I also fixed some code formatting and inconsistency that seems to have crept in over time.

jmicrobe commented 5 years ago

Thanks @jsebrof. I'm running your new code right now. Is there a certain length of time I should run it in order to confirm it works, or any other thing I could do to test the changes?

jsebrof commented 5 years ago

It should be working pretty great if you are receiving information from the ODIn on the server. In order to truly test all of the new error-handling, you'd need to log into the network router, cancel the dhcp lease that the ODIn is using, and then connect new network devices until the IP Address that the ODIn had been using is given to one of them. This would force the ODIn to reconfigure its dhcp lease in order to continue using the Ethernet connection.

On Tue, Jul 2, 2019 at 11:34 AM Jessica Hardwicke notifications@github.com wrote:

Thanks @jsebrof https://github.com/jsebrof. I'm running your new code right now. Is there a certain length of time I should run it in order to confirm it works, or any other thing I could do to test the changes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jmicrobe/odin/issues/39?email_source=notifications&email_token=AAWMQZX2FHVCBUQTW5U5NMTP5ONVHA5CNFSM4HONAVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZCFUGA#issuecomment-507795992, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWMQZXBQNQJ3BLCYNIKRALP5ONVHANCNFSM4HONAVNA .

-- Jonathan Forbes

jmicrobe commented 5 years ago

@jsebrof the server is on a static IP address and the arduino is not (I believe) although it uses an ethernet connection. Would you happen to know how this affects the dhcp lease? On the raspberrypi server I entered the following:

sudo dhclient -v
[sudo] password for muninn: 
irs_resconf_load failed: 59.
Unable to set resolver from resolv.conf; startup continuing but DDNS support may be affected
Internet Systems Consortium DHCP Client 4.3.1
Copyright 2004-2014 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/b8:27:eb:65:cd:4b
Sending on   LPF/eth0/b8:27:eb:65:cd:4b
Sending on   Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
DHCPREQUEST on eth0 to 255.255.255.255 port 67
DHCPOFFER from 128.208.236.102
DHCPACK from 128.208.236.102
bound to 128.208.236.222 -- renewal in 434 seconds.

I'm not sure if that accomplished what you suggested, but the data is still coming in from the device without issue. I'm also curious - if I assigned a static IP address to the ODIn as well would this circumvent the issue?

jsebrof commented 5 years ago

Yes, the server has a static IP address, and the ODIn is programmed to send data to that IP Address specifically. I don't think there is anything that you can do on the raspberry pi server in order to test this networking issue. What you would need to do is access the network router, which is an entirely separate device itself, and then place the ODIn into a position where it would need to reconfigure its Ethernet configuration. An alternative to this would be watching the server logs, and observing if the ODIn ever changes IP addresses during the course of a single run. This method however could take awhile, as you mentioned that the networking problem only occurred rarely.

The server needs to have a static IP address, because the ODIn needs to know where to send the data, and wouldn't know if we didn't hard-code a network location for it to send the data to. The ODIn on the other hand, doesn't need to have a static IP address, because all it does is send data to the server, and the server doesn't particularly care where the data came from. We could give the ODIn a static IP address, and that would circumvent this dhcp lease issue. This is also something that you would need access to the network router in order to accomplish. Let me know if you want to do this, and I'll change the Arduino code. The modifications that were just made should solve this issue though, provided the cause really was dhcp lease expiration.

jmicrobe commented 5 years ago

I asked the IT department if they could give me a new static IP for the ODIn. Even if the code fixes it, I think it would be good to have a failsafe. If it's easy you can tell me where this info would go and I'll edit the code when I hear back from IT. Otherwise I'll forward the email to you. Thanks!

jsebrof commented 5 years ago

As requested, pull request #42 contains modifications to have ODIn use a Static IP Address instead of DHCP.