helium / gateway-rs

The Helium Gateway
Apache License 2.0
280 stars 110 forks source link

Connection Reset #198

Closed skiplyfrance closed 2 years ago

skiplyfrance commented 2 years ago

Hi,

I installed gateway-rs on a Multitech Conduit connected with LTE and I'm losing some messages.

This mainly happens when no "valid" message has been sent for a while (after testing I would say 30 minutes).

I have this error message:

state channel error Service(Rpc(Status { code: Unknown, message: "error reading a body from connection: connection reset", source: Some(hyper::Error(Body, Error { kind: Io(Kind(ConnectionReset)) })) })), oui: 1, uri: http://52.8.80.146:8080/, pubkey: XXX, module: router

After that, the following messages sent after a short period of time are transmitted (in general, sometimes not).

The network level is good, and it never happened with Ethernet.

I'm using the same gateway with the same type of connection (LTE) at the same place with other NS, and they were all able to manage this kind of issues (mostly using Semtech PF).

Any idea on how to solve/mitigate that?

skiplyfrance commented 2 years ago

I can't be 100% sure but seems to be the consequence of a very recent update, as I didn't have this problem 3 days ago (or I didn't notice it).

madninja commented 2 years ago

I can't be 100% sure but seems to be the consequence of a very recent update, as I didn't have this problem 3 days ago (or I didn't notice it).

thing is.. there was no release done of gateway-rs in the last 3 days.

skiplyfrance commented 2 years ago

thing is.. there was no release done of gateway-rs in the last 3 days.

True.

But the problem remains. After 30 minutes, the next message is lost due to this error. Do you have an idea of what can be the reason and how to mitigate that?

madninja commented 2 years ago

No, because I really don't have enough data to help figure out what is going on. Especially if you say that other LTE units are working fine. The only guess I have is that your LTE data latency is very long somehow for that device

skiplyfrance commented 2 years ago

What I said is that the same gateway with LTE is working with Resiot and TTN, but not with gateway-rs.

The problem is regular, and happens exactly every 30 minutes, and that makes me feel this is not a question of latency.

I suspect there is a kind of asymetric UDP connection lost due to connection reset.

I will investigate.

madninja commented 2 years ago

ahh.. very interesting.. and very odd.. the "failure to read body" looks like the grpc connection got dropped somehow. Perhaps the carrier is not allowing long lived (mostly idle?) http2 connections and drops them after a while?

I can't see what the next set of messages would be since you didn't include them in the original issue text? Technically the grpc client should attempt reconnects to the router on failure, so if that's failing I'd like to know more details from the logs.

skiplyfrance commented 2 years ago

I'll be back at the office on monday and I will lead a couple of tests. I will let you know if I succeed and send you full log if not.

skiplyfrance commented 2 years ago

I tested it with an Ethernet connection. Yesterday, I received a message at 4pm. I let it run during the night. And today, nothing works. I can see the message in the logs of the gateway. There is no error. But I can't see anything in the Helium Console. Even restarting the gateway didn't solve the problem. I'm totally lost.

skiplyfrance commented 2 years ago

I have to investigate in deep to understand why gateway-rs is not stable on this GW.

skiplyfrance commented 2 years ago

Today I saw this in the console: image Which make me think I'm not totally crazy when I said that I can see the messages in the gateway logs, but nothing in the console. And the fnct "40" twice is really strange... The gateway is wired (1gb/s), so this delay does not come from the network.

skiplyfrance commented 2 years ago

This happens each time we send data now... image

lthiery commented 2 years ago

I'm sure you'll provide the answer in #201, but I'm interested in confirming which version of gateway-rs you are running for this ticket. Is it v1.0.0-alpha.22? Can you provide logs from these hours?

skiplyfrance commented 2 years ago

It is v1.0.0-alpha.21 (I'm unable to update to 22). Logs:

2022-03-01T08:34:46.919967+01:00 mtcdt helium_gateway[2034]: uplink @2181447732 us, 868.30 MHz, Ok(DataRate(SF12, BW125)), snr: 10, rssi: -33, len: 19 from MacAddress(00:80:00:00:A0:00:F1), module: gateway**
2022-03-01T08:34:56.852104+01:00 mtcdt helium_gateway[2034]: connected state channel, oui: 1, uri: http://52.8.80.146:8080/, pubkey: 112qB3YaH5bZkCnKA5uRH7tBtGNv2Y5B4smv1jsmvGUzgKT71QpE, module: router
2022-03-01T08:34:57.024335+01:00 mtcdt helium_gateway[2034]: received banner (ignored), oui: 1, uri: http://52.8.80.146:8080/, pubkey: 112qB3YaH5bZkCnKA5uRH7tBtGNv2Y5B4smv1jsmvGUzgKT71QpE, module: router, sc_id: XNCD5yAj7lGHA3LIqlPN9_LLvqtdE8zFoux2Q8cWXtw
2022-03-01T08:34:58.127141+01:00 mtcdt helium_gateway[2034]: accepting new state channel, oui: 1, uri: http://52.8.80.146:8080/, pubkey: 112qB3YaH5bZkCnKA5uRH7tBtGNv2Y5B4smv1jsmvGUzgKT71QpE, module: router, sc_id: XNCD5yAj7lGHA3LIqlPN9_LLvqtdE8zFoux2Q8cWXtw
2022-03-01T08:41:00.815981+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247826, module: dispatcher
2022-03-01T08:46:01.472244+01:00 mtcdt helium_gateway[2034]: uplink @2856002116 us, 867.30 MHz, Ok(DataRate(SF12, BW125)), snr: -17.8, rssi: -97, len: 18 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T08:47:47.924697+01:00 mtcdt helium_gateway[2034]: uplink @2962488507 us, 868.30 MHz, Ok(DataRate(SF7, BW125)), snr: 9.8, rssi: -49, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T08:50:04.130807+01:00 mtcdt helium_gateway[2034]: uplink @3098666292 us, 868.50 MHz, Ok(DataRate(SF12, BW125)), snr: 9, rssi: -73, len: 22 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T08:57:38.908680+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247841, module: dispatcher
2022-03-01T09:00:03.416329+01:00 mtcdt helium_gateway[2034]: uplink @3697967036 us, 867.90 MHz, Ok(DataRate(SF11, BW125)), snr: 9.5, rssi: -72, len: 24 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:02:47.881910+01:00 mtcdt helium_gateway[2034]: uplink @3862443476 us, 867.70 MHz, Ok(DataRate(SF7, BW125)), snr: 9, rssi: -45, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:11:19.655387+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247856, module: dispatcher
2022-03-01T09:17:47.832230+01:00 mtcdt helium_gateway[2034]: uplink @467430876 us, 867.90 MHz, Ok(DataRate(SF7, BW125)), snr: 8.8, rssi: -50, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:20:03.322618+01:00 mtcdt helium_gateway[2034]: uplink @602905612 us, 867.90 MHz, Ok(DataRate(SF11, BW125)), snr: 11, rssi: -77, len: 25 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:22:41.506853+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247869, module: dispatcher
2022-03-01T09:29:39.866859+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247873, module: dispatcher
2022-03-01T09:30:02.815599+01:00 mtcdt helium_gateway[2034]: uplink @1202413284 us, 867.30 MHz, Ok(DataRate(SF10, BW125)), snr: 11, rssi: -71, len: 24 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:32:47.786107+01:00 mtcdt helium_gateway[2034]: uplink @1367385475 us, 867.30 MHz, Ok(DataRate(SF7, BW125)), snr: 9.8, rssi: -48, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:43:43.188511+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247890, module: dispatcher
2022-03-01T09:46:01.344598+01:00 mtcdt helium_gateway[2034]: uplink @2160915708 us, 867.30 MHz, Ok(DataRate(SF12, BW125)), snr: -15.2, rssi: -95, len: 18 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:47:47.736907+01:00 mtcdt helium_gateway[2034]: uplink @2267339996 us, 867.50 MHz, Ok(DataRate(SF7, BW125)), snr: 7.5, rssi: -51, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T09:50:02.725202+01:00 mtcdt helium_gateway[2034]: uplink @2402319236 us, 867.70 MHz, Ok(DataRate(SF10, BW125)), snr: 11.2, rssi: -80, len: 22 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:00:02.679709+01:00 mtcdt helium_gateway[2034]: uplink @3002279548 us, 867.10 MHz, Ok(DataRate(SF10, BW125)), snr: 9.8, rssi: -77, len: 22 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:02:47.692034+01:00 mtcdt helium_gateway[2034]: uplink @3167294563 us, 867.10 MHz, Ok(DataRate(SF7, BW125)), snr: 10.5, rssi: -51, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:05:30.795114+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247912, module: dispatcher
2022-03-01T10:07:58.473346+01:00 mtcdt ntp_sync: Synchronizing with fr.pool.ntp.org
2022-03-01T10:08:14.358556+01:00 mtcdt ntpdate[25456]: step time server 37.187.104.44 offset 0.032862 sec
2022-03-01T10:08:14.385662+01:00 mtcdt ntp_sync: 
2022-03-01T10:17:40.893137+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247927, module: dispatcher
2022-03-01T10:17:47.684011+01:00 mtcdt helium_gateway[2034]: uplink @4067249147 us, 867.10 MHz, Ok(DataRate(SF7, BW125)), snr: 9, rssi: -50, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:20:02.618005+01:00 mtcdt helium_gateway[2034]: uplink @4202185484 us, 867.10 MHz, Ok(DataRate(SF10, BW125)), snr: 9.2, rssi: -77, len: 22 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:32:04.628920+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247942, module: dispatcher
2022-03-01T10:32:47.625130+01:00 mtcdt helium_gateway[2034]: uplink @672236419 us, 868.30 MHz, Ok(DataRate(SF7, BW125)), snr: 10.8, rssi: -53, len: 21 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:42:21.916260+01:00 mtcdt helium_gateway[2034]: uplink @1246510444 us, 867.50 MHz, Ok(DataRate(SF11, BW125)), snr: -15, rssi: -95, len: 16 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:42:27.931982+01:00 mtcdt helium_gateway[2034]: uplink @1252528636 us, 867.10 MHz, Ok(DataRate(SF11, BW125)), snr: -4, rssi: -92, len: 16 from MacAddress(00:80:00:00:A0:00:F1), module: gateway
2022-03-01T10:45:58.242251+01:00 mtcdt helium_gateway[2034]: updated routing to height 1247958, module: dispatcher
2022-03-01T10:46:36.846287+01:00 mtcdt helium_gateway[2034]: uplink @1501425084 us, 868.50 MHz, Ok(DataRate(SF12, BW125)), snr: 8.2, rssi: -36, len: 23 from MacAddress(00:80:00:00:A0:00:F1), module: gateway**
lthiery commented 2 years ago

Are you sending Confirmed Uplinks?

skiplyfrance commented 2 years ago

No

lthiery commented 2 years ago

Can you hover over the (i) on the "Uplink Dropped" rows and tell me what it says?

skiplyfrance commented 2 years ago

Late packet

lthiery commented 2 years ago

Is 2022-03-01T08:34:46 roughly the boot time or has the application been running for a long time?

This first packet seems to match roughly the 8:34 event on Console

2022-03-01T08:34:46.919967+01:00 mtcdt helium_gateway[2034]: uplink @2181447732 us, 868.30 MHz, Ok(DataRate(SF12, BW125)), snr: 10, rssi: -33, len: 19 from MacAddress(00:80:00:00:A0:00:F1), module: gateway**

However, you can see that 8:34:57 is approximately 10 seconds after the reception time. So it is quite late. You are connecting to a server in the US but that seems pretty excessive still , IMO.

The second packet matches roughly here:

2022-03-01T10:46:36.846287+01:00 mtcdt helium_gateway[2034]: uplink @1501425084 us, 868.50 MHz, Ok(DataRate(SF12, BW125)), snr: 8.2, rssi: -36, len: 23 from MacAddress(00:80:00:00:A0:00:F1), module: gateway**

This time, the seconds match almost exactly. Did you hover on this one too and confirm its a "late packet"?

No ideas yet, but just thinking out loud here...

skiplyfrance commented 2 years ago

Yes, both are late packets. And the gateway was running for more than 24 hours. Basically, it works during the first hours after a reboot, then it fails. I'm like you, I'm lost.

lthiery commented 2 years ago

It feels like something is getting backed up somewhere in a queue... but then again that second packet is right on time.

Can you confirm that the "uplink" events in the helium_gateway logs match what the packet forwarder is saying?

skiplyfrance commented 2 years ago

Yes, it matches.

skiplyfrance commented 2 years ago

Just an idea... After how much uplinks does the ADR send MAC commands?

lthiery commented 2 years ago

Just an idea... After how much uplinks does the ADR send MAC commands?

It looks like it is a minimum of 20 uplinks: https://docs.helium.com/use-the-network/console/profiles/#adr

Concerning your late packet issues, I have a few ideas:

skiplyfrance commented 2 years ago
  • we realized there was no ntp service running on the Router server, so it's possible we are inappropriately calling some packets "too late"; we will patch that as soon as we can (ntp service required router#637)

Probably a good idea. I was wondering myself if it could come from our NTP sync.

  • have you attempted to reconfigure RxDelay via the LoRaWAN profile? Perhaps try increasing it to 5 seconds and see if things are more consistent?

I will test that tomorrow in the factory, but I don't know the side effects of setting this up.

skiplyfrance commented 2 years ago

Today I set up rx Delay to 5s. I pushed an uplink, but same problem (ie this does not change the "perception" of dealy from the server perspective). I made a new join to synchronize this rx Delay between the server and the device, we will see if this change something (we have to wait 24 hours), but I don't see why it could have an impact.

skiplyfrance commented 2 years ago

2 questions (I'm exploring all possibilities):

lthiery commented 2 years ago

Does the Semtech PF need to send ping / statistics to gateway-rs or can I desactivate it? What is the recommended timeout at PF level?

The Semtech packet forwarder's have a keepalive_interval that generally is at 10 seconds and the stat_interval is every 30 seconds. These are both customizable in global_conf.json or local_conf.json if using their's.

The first "keep alive" (PULL_DATA) is what initializes the "connection" with GWMP such that the host (gateway-rs) is able to downlink after that.

To answer your question, it's not necessary to continue polling when packet forwarder and host (gateway-rs) are on the same device, but it certainly won't hurt and it should never really affect uplinks even.

Could it be a problem if the gateway is behind a firewall?

Perhaps. It might be causing something on the long-lived gRPC connection that we haven't seen before. If we can test the same gateway on a different network network, it certainly would be interesting if the problem went away.

skiplyfrance commented 2 years ago

Ok, for the moment, it works. I changed the NTP server and increase rxDelay to 5 seconds. I will wait until tomorrow and switch to LTE if it is ok.