iQMedia / tracker

Repo to hold all customer facing platform and data issues.
0 stars 0 forks source link

Some ADDB Recurring stopped running: Innovid and LiveRamp #21

Closed sergi0aranda closed 3 months ago

sergi0aranda commented 4 months ago

I have been noticing an issue with some recurring LiveRamp and Innovid jobs just stopped running. Upon further review I have noticed that they present the following behavior:

1) Did not change status to "FAILED" 2) Stayed in "QUEUED" Status, as opposed to "READY_FOR_QUEUE" in order to continue running daily.

I have already changed the few LiveRamp jobs that presented this behavior back to "READY_FOR_QUEUE", but I left the Innovid ones in "QUEUED" so we can analyze what happened and see if we can find the issue. These jobs are:

3503 | Farmers-US_Daily_CompIntel -- | -- 3514 | Nutrisystem-US_Daily_CompIntel 3556 | Expedia-US_Daily_CompIntel 3557 | Hotels-US_Daily_CompIntel 3558 | Booking-US_Daily_CompIntel 3559 | Kayak-US_Daily_CompIntel 3560 | Agoda-US_Daily_CompIntel 3561 | Trivago-US_Daily_CompIntel 3562 | AirBNB-US_Daily_CompIntel 3588 | Jenny_Craig-US_Daily_CompIntel 3590 | Allstate-US_Daily_CompIntel 3691 | Bud Light_Daily_AB Brands_CompIntel 3692 | Michelob Ultra-US_Daily_AB Brands_CompIntel 3693 | Budweiser-US_Daily_AB Brands_CompIntel Please advise, thanks!
sergi0aranda commented 3 months ago
I had to update most jobs to "READY_FOR_QUEUE" so they start running daily again but I left this one as is so you can continue to analyze it: 3560 Agoda-US_Daily_CompIntel
jmylet commented 3 months ago

This problem was caused by some intermittent and unpredictable issue(s) with the file dispatch service trying to serialize a job definition as XML and push it into Rabbit MQ for the file delivery service to process, where the job XML pushed to Rabbit MQ is somehow malformed. The job then would then become "stuck" when the file delivery service failed to de-serialize the malformed job XML and had no way of knowing what job this XML was for since any identifier was contained within that XML and therefore the job status could not be set as "FAILED" nor could the status then be set back to the "READY_FOR_QUEUE" status required to run the next day, ultimately resulting in the job getting "stuck" and prevented from running daily.

We deployed some updates to directly attach the job ID as a header on the job XML message pushed to Rabbit MQ so that the file delivery service will always have access to the job ID even if the message XML was malformed. Now when provided a job ID in a Rabbit MQ message header, the file delivery service will properly update a job status as being "FAILED" when it fails to de-serialize the job XML from RMQ. Now that these jobs are being correctly marked as "FAILED" when their message fails de-serialization, the file dispatch service will then re-queue these jobs for processing if they failed less than 24 hour ago.

Since deployment of these fixes we've been able to observe a few recurring ADDB jobs fail due to malformed job XML, have their status correctly set as "FAILED", get re-queued almost immediately, and then be successfully processed and delivered less than 5-10 minutes after the initial failure.

sergi0aranda commented 3 months ago

Excellent Jake. I have reestablish job 3560 to "READY_FOR_QUEUE". Closing the ticket.