Bug: Not all nodes registered to the same pattern get new updates from exchange

hanicornelia commented 1 year ago

Describe the bug.

We have multiple nodes registered to this pattern IOT-home-full. We currently have version 4.19.154 for service kitchen-controller in pattern IOT-home-full, and we published a new version 4.19.158 for this service, and update the pattern to use this new version. We expected for all nodes that are registered to this pattern will receive the new version, however, only some nodes received proposal message and agree with the new version, while the other nodes do not receive anything.

We verified that the version of the service in the pattern is the one we just recently published, and the timestamp of the lastUpdated is the same as the time we publish the pattern. (11 May, 4.43 UTC)

We used this command hzn exchange pattern ls <pattern> to check 2023-05-11_17-28

Next, we checked the eventlog of the nodes, where we found out some nodes are not receiving any agreement proposal.

For example, this is the eventlog for the good node IOT-NODE-1A that is successfully receiving the proposal, where we can see from the timestamp, the proposal arrives around 1 minute after we publish the pattern (11 May, 12.46 CST) 2023-05-11_17-32

But in the node that is not receiving the proposal IOT-NODE-B1, the eventlog do not show anything on 11 May. The last log is saying the node heartbeat is restored (10 May, 11.55 CST), so it should be able to communicate with the exchange. 2023-05-11_17-40

We also checked the followings to make sure the node is actually able to communicate with the exchange hub.

First in the agent container of the node IOT-NODE-B1 , the agent actually send http request to the exchange, and receives a response saying that there are no new changes from the hub. (11 May, 9.41 UTC)

2023-05-11_17-45

Then in the exchange hub, we check this node IOT-NODE-B1 and see the last heartbeat timestamp is also recent (11 May, 9.45 UTC) 2023-05-11_17-50

We also checked from inside exchange-api container, and see the http request is created for the node IOT-NODE-B1 (11 May, 7.23 UTC) 2023-05-11_17-55

So we are confident that the node is able to communicate with the exchange, though we are not sure why the node did not receive any new proposals from the exchange.
What we do to fix now is to unregister the node, and register it again. Now it able to get all the new updates, but we do not want to do this for all the nodes we have everytime we have a new updates.

Describe the steps to reproduce the behavior.

The nodes are online most of the time but they do lose connectivity to the exchange for 1 hour everyday.

Our steps to reproduce are:

Create a new exchange hub and register 10 nodes to the exchange on a single pattern
Periodically cut the network from each node at a random time of the day for a duration of 45 minutes, so that we get "node heartbeat failed" for the node
While the node is unable to communicate with the exchange, we publish a new version for one of the service for the pattern and publish the updated pattern to the exchange.
The nodes with good network gets the new proposal and deploys the service with the updated version
The node with interrupted network do not get anything, as expected
When the internet connection is restored, sometimes the node gets the new updates as expected, but sometimes they do not receive the updates, hence the bug is actually inconsistent.

Expected behavior.

All nodes registered to the same pattern should be receiving the new updates after internet connection is restored.

Screenshots.

No response

Operating Environment

Node details:

OS: Ubuntu 20
hzn cli and agent version (container): 2.30.0-1160

Exchange details:

OS: ubuntu 22
exchange version: 2.110.1
hzn cli version: 2.30.0-1177

Additional Information

No response

dlarson04 commented 1 year ago

@hanicornelia - Would it be possible to capture the agent logs on one of the agents that does not get upgraded? Capture the /var/log/syslog and gzip it up and post it here if you can.

dlarson04 commented 1 year ago

Closing https://github.com/open-horizon/anax/issues/3703 which sounds like same issue.

hanicornelia commented 1 year ago

These are all the logs/files for your reference.

An additional information is that we discovered that this bug happened to nodes that are unable to communicate for more than 2 hours to the exchange, but did not occur if its less than 1 hour.

Also these are the updated steps to reproduce:

Exchange hub: 1) Versions for all containers used:

agbot: 2.30.0-1177
exchange-api: 2.110.1-1003
css-api: 1.9.10-1177
sdo-owner-services: 1.11.16-913
vault: 1.1.1-641
postgres: 13
mongo: 5.0

2) To publish the service and pattern, untar the publish.tar.gz file and run sudo ./publish_service_pattern.sh -v 4.19.154 (you might need to change some env variables first according to your setup and the registry used for the images). publish.tar.gz

3) Service corn_org/kitchen-controller_4.19.154_amd64 and pattern corn_org/corn_org-pattern-IOT-home-full will be published to the exchange hub.

Next, on the node side: 1) Version used for hzn cli and agent container is 2.30.0-1160

2) Run register-node.sh and pass the node name and exchange ip, ie sudo ./register-node.sh -n node-1 -i <exchange-ip>

3) The node will be registered to the exchange and deploy the service in the pattern successfully.

4) Next, run these to cut the network connection to the exchange inside agent container docker exec -it horizon1 bash iptables -A OUTPUT -p tcp -m state --state NEW,RELATED,ESTABLISHED -m tcp -d <exchange-ip> -j DROP && iptables-save

5) You will see node heartbeat failed error in the eventlog.

6) Now run the publishing again with new version sudo ./publish_service_pattern.sh -v 4.19.158 to the exchange.

7) Wait for > 2 hours, then restore the connection to the node docker exec -it horizon1 bash iptables -D OUTPUT -p tcp -m state --state NEW,RELATED,ESTABLISHED -m tcp -d <exchange-ip> -j DROP && iptables-save

8) You will see node heartbeat restored, but no new agreements coming.

With that, below are the logs for your reference, notable timestamp are: I0518 10:23:40.144764 (when heartbeat is restored, it does gets the message for the new version) 2023-05-19_10-21

horizon1.log

hanicornelia commented 1 year ago

I built the master branch (commit #3807) and did the bug reproduction steps. However, bug was not solved with this pull request.

This is the screenshot. 2023-06-16_10-59

Attached is the agent logs. horizon1.log

Similar to before, we can see the node knows there are new version of the service, but somehow it didnt form an agreement with it, so the service is not updated after the heartbeat is restored.

open-horizon / anax