icgc-argo / workflow-relay

Collecting information from workflows to report
GNU Affero General Public License v3.0
1 stars 0 forks source link

🐛 Racoon service does not clean up stuck RDPC runs #145

Closed UmmulkiramR closed 1 year ago

UmmulkiramR commented 1 year ago

When collab crashes, the jobs in progress get stuck in the Running state while their respective pods are in error. Running the racoon service should fix the status of these jobs. However, that does not happen. After investigation it was found to be a problem with the serialization of the OffsetDateTime field in the workflow metadata.

UmmulkiramR commented 1 year ago

The raccoon service now cleans up runs after the fix. However, there is another bug in the service that replaces all the parameter values with null. Keeping this ticket open until that's fixed.

Buwujiu commented 1 year ago

Not reproducible now, the original issue happens when cluster crashed, move it to icebox.

UmmulkiramR commented 1 year ago

Analyzing the issue after the collab crash it was found that the relay overwrites the parameter map and the engine params with the values it receives from the raccoon. Since raccoon isn't sending any of these they all end up being null when the ES document is updated. Now, with the fix, the relay will check for the presence of these values in the incoming event before updating them in the ES.

UmmulkiramR commented 1 year ago

Cannot be tested on dev or QA (since we don't have any runs stuck there) so queuing it up for release.

Buwujiu commented 1 year ago

Will Plan for a prod release today.

UmmulkiramR commented 1 year ago

released to prod and tested.