Closed UmmulkiramR closed 1 year ago
The raccoon service now cleans up runs after the fix. However, there is another bug in the service that replaces all the parameter values with null. Keeping this ticket open until that's fixed.
Not reproducible now, the original issue happens when cluster crashed, move it to icebox.
Analyzing the issue after the collab crash it was found that the relay overwrites the parameter map and the engine params with the values it receives from the raccoon. Since raccoon isn't sending any of these they all end up being null when the ES document is updated. Now, with the fix, the relay will check for the presence of these values in the incoming event before updating them in the ES.
Cannot be tested on dev or QA (since we don't have any runs stuck there) so queuing it up for release.
Will Plan for a prod release today.
released to prod and tested.
When collab crashes, the jobs in progress get stuck in the Running state while their respective pods are in error. Running the racoon service should fix the status of these jobs. However, that does not happen. After investigation it was found to be a problem with the serialization of the OffsetDateTime field in the workflow metadata.