Open hanicornelia opened 1 year ago
@hanicornelia - Would it be possible to capture the agent logs on one of the agents that does not get upgraded? Capture the /var/log/syslog and gzip it up and post it here if you can.
Closing https://github.com/open-horizon/anax/issues/3703 which sounds like same issue.
These are all the logs/files for your reference.
An additional information is that we discovered that this bug happened to nodes that are unable to communicate for more than 2 hours to the exchange, but did not occur if its less than 1 hour.
Also these are the updated steps to reproduce:
Exchange hub: 1) Versions for all containers used:
2) To publish the service and pattern, untar the publish.tar.gz
file and run sudo ./publish_service_pattern.sh -v 4.19.154
(you might need to change some env variables first according to your setup and the registry used for the images).
publish.tar.gz
3) Service corn_org/kitchen-controller_4.19.154_amd64
and pattern corn_org/corn_org-pattern-IOT-home-full
will be published to the exchange hub.
Next, on the node side:
1) Version used for hzn cli and agent container is 2.30.0-1160
2) Run register-node.sh and pass the node name and exchange ip, ie sudo ./register-node.sh -n node-1 -i <exchange-ip>
3) The node will be registered to the exchange and deploy the service in the pattern successfully.
4) Next, run these to cut the network connection to the exchange inside agent container
docker exec -it horizon1 bash
iptables -A OUTPUT -p tcp -m state --state NEW,RELATED,ESTABLISHED -m tcp -d <exchange-ip> -j DROP && iptables-save
5) You will see node heartbeat failed error in the eventlog.
6) Now run the publishing again with new version sudo ./publish_service_pattern.sh -v 4.19.158
to the exchange.
7) Wait for > 2 hours, then restore the connection to the node
docker exec -it horizon1 bash
iptables -D OUTPUT -p tcp -m state --state NEW,RELATED,ESTABLISHED -m tcp -d <exchange-ip> -j DROP && iptables-save
8) You will see node heartbeat restored, but no new agreements coming.
With that, below are the logs for your reference, notable timestamp are: I0518 10:23:40.144764 (when heartbeat is restored, it does gets the message for the new version)
I built the master branch (commit #3807) and did the bug reproduction steps. However, bug was not solved with this pull request.
This is the screenshot.
Attached is the agent logs. horizon1.log
Similar to before, we can see the node knows there are new version of the service, but somehow it didnt form an agreement with it, so the service is not updated after the heartbeat is restored.
Describe the bug.
We have multiple nodes registered to this pattern
IOT-home-full
. We currently have version4.19.154
for servicekitchen-controller
in patternIOT-home-full
, and we published a new version4.19.158
for this service, and update the pattern to use this new version. We expected for all nodes that are registered to this pattern will receive the new version, however, only some nodes received proposal message and agree with the new version, while the other nodes do not receive anything.lastUpdated
is the same as the time we publish the pattern. (11 May, 4.43 UTC)We used this command
hzn exchange pattern ls <pattern>
to checkFor example, this is the eventlog for the good node
IOT-NODE-1A
that is successfully receiving the proposal, where we can see from the timestamp, the proposal arrives around 1 minute after we publish the pattern (11 May, 12.46 CST)But in the node that is not receiving the proposal
IOT-NODE-B1
, the eventlog do not show anything on 11 May. The last log is saying the node heartbeat is restored (10 May, 11.55 CST), so it should be able to communicate with the exchange.First in the agent container of the node
IOT-NODE-B1
, the agent actually send http request to the exchange, and receives a response saying that there are no new changes from the hub. (11 May, 9.41 UTC)Then in the exchange hub, we check this node
IOT-NODE-B1
and see the last heartbeat timestamp is also recent (11 May, 9.45 UTC)We also checked from inside exchange-api container, and see the http request is created for the node
IOT-NODE-B1
(11 May, 7.23 UTC)So we are confident that the node is able to communicate with the exchange, though we are not sure why the node did not receive any new proposals from the exchange.
What we do to fix now is to unregister the node, and register it again. Now it able to get all the new updates, but we do not want to do this for all the nodes we have everytime we have a new updates.
Describe the steps to reproduce the behavior.
The nodes are online most of the time but they do lose connectivity to the exchange for 1 hour everyday.
Our steps to reproduce are:
Expected behavior.
All nodes registered to the same pattern should be receiving the new updates after internet connection is restored.
Screenshots.
No response
Operating Environment
Node details:
Exchange details:
Additional Information
No response