Open scblack321 opened 2 months ago
Reported by Joshua Urrutia on 4/3/2024, slack Tacc Team - tapis-v3-transition. Jobs log file saved at tapisprod@cic02:/home/tapisprod/jobs_no_transition_error/job_worker_from_pod1.log
Analysis so far. (line numbers are from file job_worker_from_pod1.log).
Example timeline including line numbers and some timestamps:
Worker thread with id AGX6 is monitoring job b22c81ab-6a66-4584-94c0-1812bb359ec1-007
9006 2024-04-01 17:01:04.342 DEBUG [wkr-DefaultQueue-44/AGX6] e.u.t.t.jobs.monitors.SlurmMonitor
Some recovery process begins, possibly related to rabbitmq
27210 2024-04-03 09:23:15.876 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.client.impl.ConsumerWorkService:40 - Creating executor service with 8 thread(s) for consumer work service
27211 2024-04-03 09:23:15.886 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.c.i.r.AutorecoveringConnection:593 - Connection amqp://jobs@10.98.141.219:5672/JobsHost has recovered
27212 2024-04-03 09:23:15.892 DEBUG [AMQP Connection 10.98.141.219:5672/] c.r.c.i.r.AutorecoveringConnection:653 - Channel AMQChannel(amqp://jobs@10.98.141.219:5672/JobsHost,1) has recovered
New worker thread with id pyoj is now also handling the job. Previous worker (AGX6) is still running.
27628 2024-04-03 09:23:16.438 DEBUG [wkr-DefaultQueue-4/pyoj] e.u.t.t.j.worker.JobQueueProcessor:128 -
Worker pyoj encounters an interesting exception:
28113 2024-04-03 09:23:16.471 ERROR [wkr-DefaultQueue-4/pyoj] e.u.t.tapis.shared.utils.TapisUtils:750 - TAPIS_REFLECTION_ERROR Unable to process the getConstructor reflective method on class edu.utexas.tacc.tapis.shared.exceptions.TapisImplException: TAPIS_APPLOAD_INTERNAL_ERROR Applications service internal error when accessing application urrutia-aggregator-0.6 for user urrutia@a2cps.
28114 java.lang.NoSuchMethodException: edu.utexas.tacc.tapis.shared.exceptions.TapisImplException.<init>(java.lang.String,java.lang.Throwable)
Worker pyoj updates job status to failed
28428 2024-04-03 09:23:16.487 INFO [wkr-DefaultQueue-4/pyoj] c.p.engine.spy.appender.Slf4JLogger:60 - jdbc:postgresql://jobs-postgres:5432/tapisjobsdb|12571|statement: INSERT INTO job_events (event, created, job_uuid, event_detail, oth_uuid, description, tenant) VALUES ('JOB_ERROR_MESSAGE'::job_event_enum, '2024-04-03T09:23:16.482+0000', 'b22c81ab-6a66-4584-94c0-1812bb35 9ec1-007', 'FAILED', NULL, '{"jobStatus":"FAILED","jobName":"aggregate","jobUuid":"b22c81ab-6a66-4584-94c0-1812bb359ec1-007","jobOwner":"urrutia","message":"JOBS_WORKER_PROCESSING_ERROR An exception occurred on JobWorker wkr-DefaultQueue-4 on queue tapis.jobq.submit.DefaultQueue running wkr-DefaultQueue-JobQueueProcessor while processing job b22c81ab-6a66-4584-94c0- 1812bb359ec1-007: TAPIS_APPLOAD_INTERNAL_ERROR Applications service internal error when accessing application urrutia-aggregator-0.6 for user urrutia@a2cps."}', 'a2cps')
Worker AGX6 attempts to update job to RUNNING. Possibly slurm has just reported job has started
45829 2024-04-03 20:17:45.327 ERROR [wkr-DefaultQueue-44/AGX6] e.u.t.t.j.worker.JobQueueProcessor:186 - JOBS_WORKER_PROCESSING_ERROR An exception occurred on JobWorker wkr-DefaultQueue-44 on queue tapis.jobq.submit.DefaultQueue running wkr-DefaultQueue-JobQueueProcessor while processing job b22c81ab-6a66-4584-94c0-1812bb359ec1-007: JOBS_JOB_SELECT_UUID_ERROR Unable to select job b22c81ab-6a66-4584-94c0-1812bb359ec1-007 in tenant a2cps for user urrutia: JOBS_STATE_NO_TRANSITION Job b22c81ab-6a66-4584-94c0-1812bb359ec1-007 attempted an illegal state transition from FAILED to RUNNING.
The log information above is not complete but does point us to a feasible explanation about what happened. This is a summary of what looks like happened:
Since this is the first time we've seen a JOBS_STATE_NOTRANSITION error, we can assume is a rare event triggered by a communication problem between RabbitMQ and its client code. RabbitMQ guarentees at least once delivery of messages_, but duplicates can occur. Duplication does not normally happen because messages aren't acked until their job is in a terminal state. Ostensibly, RabbitMQ communication recovery can cause un-acked messages to be resent.
For now, the solution is to simply fix the secondary reflection failure. Since RabbitMQ automatically recovers its client/server connections without application knowledge, Jobs would have to track the jobs currently assigned to all threads. Doing this would entail avoiding various race conditions while making sure our tracking doesn't indicate that a job is assigned a thread after that thread dies. In short, building a complex detection mechanism for an extremely rare event is probably not worth it.
Jobs fail reporting error JOBS_STATE_NO_TRANSITION. Jobs impacted: c50fa371-34fb-467d-bd07-d07fe2c774f8-007 bbd8ea6f-b7e1-4e72-aed5-3c7b8492c1d1-007 b22c81ab-6a66-4584-94c0-1812bb359ec1-007 0dec9b46-451d-4884-9ce9-a3d7b0887dd2-007