Issue - Worker missconfiguration

AlbertoSoutullo commented 1 month ago

This was something I realized with the nimlibp2p regression testing in November: Report And was also commented in the Discord thread

Sometimes the timestamps reported by the script were weird.

The problem was happening on nodes that were between W17 and W24 were logging this.

Today, we were debugging another issue regarding nWaku store protocol: Issue notes

After some painfully debugging, basically the issue why some messages were not being saved in the store despite being received by the waku node is that those messages were being discarded because they were too old. Actually, if 100 messages were being injected, maybe 85 were being in one store, and the remaining 15 were being saved in another store. This would depend on where the pod would be created.

All of this led up to again the problem with these workers being not properly synchronized.

This should be fixed as it is a critical problem in the simulations and the results, and could also be affecting other functionalities in the lab.

Zorlin commented 1 month ago

Tagging @michatinkers

We believe we've fixed this by force syncing all the nodes thus fixing their times

Please come up with a more long term solution - essentially make sure Chrony is configured on all nodes (chronyd) and is ACTUALLY syncing and that we monitor this to make sure it remains true.

michatinkers commented 1 month ago

Chrony has now been installed and set up on nodes W17 - W24

vacp2p / vaclab

Issue - Worker missconfiguration #44