orcfax / Incidents

A repository to triage and report issues in Orcfax network operations
1 stars 0 forks source link

INCIDENT 035 | Failure of Plutus Chain Indexer #38

Open Christian-MK opened 3 months ago

Christian-MK commented 3 months ago

INCIDENT 035 | Failure of Plutus Chain Indexer

Trigger

Date

2024-05-15

Summary

At 1601 UTC on 15 May Plutus-chain-index stopped synching with mainnet which prevented the publication of the Orcfax ADA-USD feed. Despite extensive investigation, the Orcfax team was unable to revive the software or the mainnet feed.

Status

Resolved

Assessment

Plutus-chain-index is part of Intersect: https://github.com/IntersectMBO/plutus-apps. The component was deeply embedded in the v0 Orcfax Architecture as part of COOP. Shortly after COOP's release, the chain-index component was abandoned by IOG. The Orcfax team was aware of this and identified the component as one of the system's biggest risks during mainnet release of v0. In response, the team began work on v1 architecture which removed the component and was within weeks of transitioning to v1 when the component failed.

Orcfax attempted to triage the component for the first 48 hours immediately after the failure but was unable to revive it. During that time a decision was made to contact known integrators to ask that they move to their fall-back oracle services until more progress could be made by Orcfax.

The team continued to work on the component for the following week in an effort to better understand the failure, but nothing conclusive was ascertained. While the chain-index is still synching from COOP's genesis point in September 2024 at the time of writing, a decision was made to drop support for Orcfax's v0 oracle and continue with efforts to bring a v1 oracle to the Cardano chain.

Additional Notes

Specific errors

After starting plutus-chain-index with the --verbose parameter, it executes a few queries, then it hangs at this query:

[chain-index:Debug:40] [2024-05-25 07:56:33.62 UTC] {"contents":"UPDATE \"unspent_outputs\" SET \"output_row_tip__row_slot\"=? WHERE (\"output_row_tip__row_slot\")<(?);\n-- With values: [SQLInteger 124179884,SQLInteger 124179884]","tag":"BeamLogItem"}

Manually running the following query returns an error:

sqlite> UPDATE unspent_outputs SET output_row_tip__row_slot = 124179884 WHERE output_row_tip__row_slot < 124179884;
Error: stepping, UNIQUE constraint failed: unspent_outputs.output_row_tip__row_slot, unspent_outputs.output_row_out_ref (19)
sqlite> 

It appears that the plutus-chain-index component is hanging because it gets caught in a loop whereby it retries the same query over and over again. The Orcfax team does not have the skills internally to perform more detailed analysis of plutus-chain-index.

Specific issues

We are unaware of any Cardano ecosystem events which may have precipitated the failure of this component.

Rebuilding the database was not an option as it initially required ~2 weeks+ to rebuild from 55% sync.

Orcfax Mistakes

Impact mitigation

The impact of this event has been mitigated by key factors:

  1. Since late 2023, prospective integrators were informed of the forthcoming V1 architecture and encouraged to wait for its release as necessary optimization of the datum schema would result in breaking changes for smart contracts.
  2. Monitoring of Orcfax UTxO usage was deployed prior to this issue, which allowed the team to assess the impact of this outage on the community; those who were using the Orcfax feed were notified promptly and back-up solutions were activated.

Technical improvements

The Orcfax v0 solution which utilized this component has been retired. Work on the v1 architecture continues and the team continues to engage in dialogue with integrators as to when they will begin their integrations of the new datum.

We are investigating:

  1. In the v1 solution we are using off-the-shelf components with proven support in the Cardano community such as Kupo and Ogmios.
  2. Backup procedures will be investigated so as to ensure any indexes of significant size will have secondary copies available to us.
  3. Recovery procedures will be investigated and time to launch will be reduced with a paradigm shift meaning the Oracle dApp will mostly be looking at the tip of the Cardano chain versus its entire history as was in the COOP solution.

Documentation improvements

  1. Historical policy data will be maintained by Orcfax and instructions provided how to access historical archival packages on Arweave.
  2. With the deprecation of the v0 protocol this repository will be closed and the lessons learned gathered and input into the v1 project. A new incidents repository will be opened with more transparent access to issues via the Orcfax Explorer and documentation pages.