Jobs: Error JOBS_STATE_NO_TRANSITION

scblack321 commented 2 months ago

Jobs fail reporting error JOBS_STATE_NO_TRANSITION. Jobs impacted: c50fa371-34fb-467d-bd07-d07fe2c774f8-007 bbd8ea6f-b7e1-4e72-aed5-3c7b8492c1d1-007 b22c81ab-6a66-4584-94c0-1812bb359ec1-007 0dec9b46-451d-4884-9ce9-a3d7b0887dd2-007

scblack321 commented 2 months ago

Reported by Joshua Urrutia on 4/3/2024, slack Tacc Team - tapis-v3-transition. Jobs log file saved at tapisprod@cic02:/home/tapisprod/jobs_no_transition_error/job_worker_from_pod1.log

Analysis so far. (line numbers are from file job_worker_from_pod1.log).

Example timeline including line numbers and some timestamps:

Worker thread with id AGX6 is monitoring job b22c81ab-6a66-4584-94c0-1812bb359ec1-007

9006 2024-04-01 17:01:04.342 DEBUG [wkr-DefaultQueue-44/AGX6] e.u.t.t.jobs.monitors.SlurmMonitor

Some recovery process begins, possibly related to rabbitmq

27210 2024-04-03 09:23:15.876 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.client.impl.ConsumerWorkService:40 - Creating executor service with 8 thread(s) for consumer work service
27211 2024-04-03 09:23:15.886 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.c.i.r.AutorecoveringConnection:593 - Connection amqp://jobs@10.98.141.219:5672/JobsHost has recovered
27212 2024-04-03 09:23:15.892 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.c.i.r.AutorecoveringConnection:653 - Channel AMQChannel(amqp://jobs@10.98.141.219:5672/JobsHost,1) has recovered

New worker thread with id pyoj is now also handling the job. Previous worker (AGX6) is still running.

27628 2024-04-03 09:23:16.438 DEBUG [wkr-DefaultQueue-4/pyoj] e.u.t.t.j.worker.JobQueueProcessor:128 -

Worker pyoj encounters an interesting exception:

28113 2024-04-03 09:23:16.471 ERROR [wkr-DefaultQueue-4/pyoj] e.u.t.tapis.shared.utils.TapisUtils:750 - TAPIS_REFLECTION_ERROR Unable to process the getConstructor reflective method on class edu.utexas.tacc.tapis.shared.exceptions.TapisImplException: TAPIS_APPLOAD_INTERNAL_ERROR Applications service internal error when accessing application urrutia-aggregator-0.6 for user       urrutia@a2cps.
28114 java.lang.NoSuchMethodException: edu.utexas.tacc.tapis.shared.exceptions.TapisImplException.<init>(java.lang.String,java.lang.Throwable)

Worker pyoj updates job status to failed

28428 2024-04-03 09:23:16.487 INFO  [wkr-DefaultQueue-4/pyoj] c.p.engine.spy.appender.Slf4JLogger:60 - jdbc:postgresql://jobs-postgres:5432/tapisjobsdb|12571|statement: INSERT INTO job_events (event, created, job_uuid, event_detail, oth_uuid, description, tenant) VALUES ('JOB_ERROR_MESSAGE'::job_event_enum, '2024-04-03T09:23:16.482+0000', 'b22c81ab-6a66-4584-94c0-1812bb35      9ec1-007', 'FAILED', NULL, '{"jobStatus":"FAILED","jobName":"aggregate","jobUuid":"b22c81ab-6a66-4584-94c0-1812bb359ec1-007","jobOwner":"urrutia","message":"JOBS_WORKER_PROCESSING_ERROR An exception occurred on JobWorker wkr-DefaultQueue-4 on queue tapis.jobq.submit.DefaultQueue running wkr-DefaultQueue-JobQueueProcessor while processing job b22c81ab-6a66-4584-94c0-      1812bb359ec1-007: TAPIS_APPLOAD_INTERNAL_ERROR Applications service internal error when accessing application urrutia-aggregator-0.6 for user urrutia@a2cps."}', 'a2cps')

Worker AGX6 attempts to update job to RUNNING. Possibly slurm has just reported job has started

45829 2024-04-03 20:17:45.327 ERROR [wkr-DefaultQueue-44/AGX6] e.u.t.t.j.worker.JobQueueProcessor:186 - JOBS_WORKER_PROCESSING_ERROR An exception occurred on JobWorker wkr-DefaultQueue-44 on queue tapis.jobq.submit.DefaultQueue running wkr-DefaultQueue-JobQueueProcessor while processing job b22c81ab-6a66-4584-94c0-1812bb359ec1-007: JOBS_JOB_SELECT_UUID_ERROR Unable to select job b22c81ab-6a66-4584-94c0-1812bb359ec1-007 in tenant a2cps for user urrutia: JOBS_STATE_NO_TRANSITION Job b22c81ab-6a66-4584-94c0-1812bb359ec1-007 attempted an illegal state transition from FAILED to RUNNING.

richcar58 commented 2 months ago

The log information above is not complete but does point us to a feasible explanation about what happened. This is a summary of what looks like happened:

Communication with RabbitMq failed and was automatically recovered.
RabbitMq delivered duplicate messages for each of the 4 jobs listed above.
These duplicate messages caused jobs that already were executing on a worker thread to be assigned to a second worker thread. All bets are off at this point---a job cannot properly run in two theads at the same time.
The new worker thread detected some type of error and tried to report that error using a reflective routine.
Reflection failed to create a new exception because TapisImplException does not have a constructor with the required signature, so the new thread FAILED the job.
The original thread logged a JOBS_STATE_NO_TRANSITION error when the state machine detected an attempted illegal state transition from FAILED to RUNNING.

Since this is the first time we've seen a JOBS_STATE_NOTRANSITION error, we can assume is a rare event triggered by a communication problem between RabbitMQ and its client code. RabbitMQ guarentees at least once delivery of messages_, but duplicates can occur. Duplication does not normally happen because messages aren't acked until their job is in a terminal state. Ostensibly, RabbitMQ communication recovery can cause un-acked messages to be resent.

For now, the solution is to simply fix the secondary reflection failure. Since RabbitMQ automatically recovers its client/server connections without application knowledge, Jobs would have to track the jobs currently assigned to all threads. Doing this would entail avoiding various race conditions while making sure our tracking doesn't indicate that a job is assigned a thread after that thread dies. In short, building a complex detection mechanism for an extremely rare event is probably not worth it.

tapis-project / tapis-jobs

Jobs: Error JOBS_STATE_NO_TRANSITION #83