pgadmin-org / pgagent

pgAgent - A job scheduler for PostgreSQL
https://www.pgadmin.org/
Other
103 stars 41 forks source link

pgAgent "Job no steps to execute!" problem #25

Open kocersema opened 2 years ago

kocersema commented 2 years ago

Hello, I use pgagent_14-4.2.1. Jobs run every minute. Sometimes this error occurs but there are steps in pgagent.pga_jobstep.

Problem name: [2 / requestloginsert_minute] Job no steps to execute!

pgagent.pga_joblog :

pgagent_1

In the screenshot, 3 jobs start and end at the same time.

pgagent_2

How can I solve this problem?

Thank you.

HalacliSeda commented 1 year ago

Hello, I am also getting the same error. I use pgagent_14-4.2.1 and postgresql14-14.5.

I have a job that runs every minute. Sometimes it works fine, but sometimes it gives an error that there is no step.

Here is my job step:

[postgres] # select * from pgagent.pga_jobstep where jstjobid='1'; jstid | jstjobid | jstname | jstdesc | jstenabled | jstkind | jstcode | jstconnstr | jstdbname | jstonerror | jscnextrun -------+----------+---------+---------+------------+---------+--------------------------------------+------------+-----------+------------+------------ 1 | 1 | steps | | t | s | CALL public.myfunction() | | postgres | f | (1 row)

Here is job log:

[postgres] # select * from pgagent.pga_joblog order by jlgid desc limit 20; jlgid | jlgjobid | jlgstatus | jlgstart | jlgduration
-------+----------+-----------+-------------------------------+----------------- 40069 | 1 | r | 2022-11-16 13:28:05.793086+02 | 40068 | 1 | s | 2022-11-16 13:27:03.447569+02 | 00:00:00.996283 40067 | 1 | s | 2022-11-16 13:26:04.197752+02 | 00:00:01.021981 40066 | 1 | s | 2022-11-16 13:25:05.454255+02 | 00:00:02.326982 40065 | 1 | s | 2022-11-16 13:24:04.219011+02 | 00:00:00.132577 40064 | 1 | s | 2022-11-16 13:23:02.819775+02 | 00:00:03.620179 40063 | 1 | s | 2022-11-16 13:22:04.076578+02 | 00:00:02.671583 40062 | 1 | s | 2022-11-16 13:21:04.036276+02 | 00:00:03.272034 40061 | 1 | s | 2022-11-16 13:20:02.968261+02 | 00:00:04.061977 40060 | 1 | s | 2022-11-16 13:19:04.412659+02 | 00:00:00.87664 40059 | 1 | s | 2022-11-16 13:18:03.230314+02 | 00:00:00.065278 40058 | 1 | s | 2022-11-16 13:17:04.748409+02 | 00:00:12.235386 40057 | 1 | i | 2022-11-16 13:16:08.327694+02 | 40056 | 1 | s | 2022-11-16 13:16:08.325891+02 | 00:00:00.1198 40055 | 1 | s | 2022-11-16 13:15:01.499926+02 | 00:00:00.087013 40054 | 1 | s | 2022-11-16 13:14:02.947203+02 | 00:00:04.059835 40053 | 1 | i | 2022-11-16 13:13:21.300041+02 | 40052 | 1 | i | 2022-11-16 13:13:21.19951+02 | 40051 | 1 | s | 2022-11-16 13:13:12.603287+02 | 00:00:16.804197 40050 | 1 | s | 2022-11-16 13:12:02.3888+02 | 00:00:01.504216

Could this be a bug? Has anyone had the same problem and found a solution?

Thanks, Seda

akshay-joshi commented 1 year ago

Hi @kocersema and @HalacliSeda

I am not able to reproduce the issue on Windows 10, I have created a job that runs every minute and called the procedure as a part of the job step. I have tested two separate code First: BEGIN PERFORM 1; END;

Second: BEGIN PERFORM pg_sleep(80); //Sleep for 80 seconds END;

In both the above cases, the issue was not reproducible. I have raised the PR 29 which contains some Debug/Warning logs to identify the issue.

akshay-joshi commented 1 year ago

Hi @kocersema and @HalacliSeda

On which platform you are running pgAgent?

HalacliSeda commented 1 year ago

Hi @akshay-joshi , I got this error on Rocky 9. In my case, I run a procedure which is run 5-10 seconds max, and I run this job every minute.

akshay-joshi commented 1 year ago

Hi @HalacliSeda

I have observed one more thing in your pga_joblog table. It seems that for some reason the jobs throwing an error 'No steps to execute' are scheduled at the same time or before a minute. It could be the reason pgAgent ignoring those steps. Please refer to the below screenshot:

pgagent

We have added the warning logs as part of the PR 29 to identify the exact issue.

HalacliSeda commented 1 year ago

Hi @akshay-joshi , yes, you're right. Success message came every time before giving this error. So is it possible to prevent this error that comes after the success message?

Thanks.

bbourgier commented 3 months ago

Hi Reporting same behavior. Success on scheduled time BUT successive logs just after with 'i' status. This is not really a problem but after looking at the code I think something could be done to avoid those Job Run attempts falling down to 'i' exit status.

What I see is that there is a Job running loop at pgagent.cpp line 102 which relies on "jobnextrun <= now()" to try and run the Job. But when the Job is run at the scheduled time, during the Job runtime (which can be long), the 'jobnextrun' column is updated ONLY when the Job run is finished. Which means that during the Job runtime, the condition is still fullfilled to try and run the Job in the loop. For instance I have: image As you can see, 'joblastrun' is updated with the scheduled time (this is ok) but the 'jobnextrun' is still set to a time in the past.

And this is why you'll get in the job log table: 1 line with status 'r' while the Job is running several lines with status 'i' created while the Job is running And at last the line with status 'r' is updated with status 's' (on success of course)

@akshay-joshi Do you think it would be possible to avoid those 'i' lines by NOT trying and run the Job while it is already running. Either by updating the 'jobnextrun' column at the same moment you update the 'joblastrun' column, or by checking the Job status before trying to run it again?

Thanks in advance and sorry if not very clear in my explanations.