orcfax / Incidents

A repository to triage and report issues in Orcfax network operations
1 stars 0 forks source link

INCIDENT 029 | Heartbeat publication not completed by COOP #32

Open Christian-MK opened 6 months ago

Christian-MK commented 6 months ago

Trigger

Date

2024-04-14

Summary

COOP didn't complete publication of the 23:00 (UTC) heartbeat on 14 April. The issue self-corrected at the next interval, i.e. 00:00 (UCT) on 15 April.

Status

Under Review

Assessment

It is still unclear as to why the heartbeat was missed. The Orcfax team continues to work towards converting the COOP component coop-sock to systemd. Until this conversion is complete, logging remains incomplete.

Additional Notes

Most recently, similar failures for COOP to complete publishing on-chain have been caused by the sync state of the Plutus Chain Index which is to be replaced with important reliability changes in COOP v2.

Persistent logging of COOP issues will be added with the completed COOP work as the project seeks to address a number of concerns in a holistic upgrade of the Orcfax network.

Technical improvements

We are investigating:

  1. Completing the transition from coop-sock to systemd (currently active in preprod).
  2. Implementing improved logging to better understand these issues.
  3. Coverage for colleagues monitoring the network during weekend periods so that datum can be published manually once the issue arises.

Documentation improvements

  1. The issue will be added to devops documentation to assist future team members with triaging like incidents.