Open hadijaveed opened 3 months ago
Logs suggest the root cause is that the pyscopg driver failed to obtain a connection and timed out.
psycopg_pool.PoolTimeout: couldn't get a connection after 30.00 sec
The procrastinate worker will give up when the main coroutine throws an error, which is what happened here.
Maybe the library could add more resiliency by retrying non critical errors.
In the meantime, you might be able to achieve some of that by wrapping the app run in a loop that catches errors and restarts the worker. That would prevent crashing the whole application.
@onlyann thank you for your reply. how can I run a loop to catch these errors? Database connection is available, I was able to validate other Psycog library connections, not sure why single_worker runs into this error, also looks like we need to improve the logs
so I added the loop and re-try mechanism, but seems like following SQL error breaks the full flow
db | 2024-07-10 18:00:31.609 UTC [37] STATEMENT: SELECT id, status, task_name, priority, lock, queueing_lock, args, scheduled_at, queue_name, attempts
db | FROM procrastinate_fetch_job($1);
db | 2024-07-10 18:00:31.615 UTC [36] ERROR: canceling statement due to user request
db | 2024-07-10 18:00:31.615 UTC [36] CONTEXT: SQL statement "INSERT
db | INTO procrastinate_periodic_defers (task_name, periodic_id, defer_timestamp)
db | VALUES (_task_name, _periodic_id, _defer_timestamp)
db | ON CONFLICT DO NOTHING
db | RETURNING id"
db | PL/pgSQL function procrastinate_defer_periodic_job(character varying,character varying,character varying,character varying,character varying,bigint,jsonb) line 7 at SQL statement
db | 2024-07-10 18:00:31.615 UTC [36] STATEMENT: SELECT procrastinate_defer_periodic_job($1, $2, $3, $4, $5, $6, $7) AS id;
db | 2024-07-10 18:00:36.635 UTC [39] ERROR: column "priority" does not exist at character 31
db | 2024-07-10 18:00:36.635 UTC [39] STATEMENT: SELECT id, status, task_name, priority, lock, queueing_lock, args, scheduled_at, queue_name, attempts
db | FROM procrastinate_fetch_job($1);
db | 2024-07-10 18:00:36.647 UTC [38] ERROR: canceling statement due to user request
db | 2024-07-10 18:00:36.647 UTC [38] CONTEXT: SQL statement "INSERT
db | INTO procrastinate_periodic_defers (task_name, periodic_id, defer_timestamp)
db | VALUES (_task_name, _periodic_id, _defer_timestamp)
db | ON CONFLICT DO NOTHING
db | RETURNING id"
db | PL/pgSQL function procrastinate_defer_periodic_job(character varying,character varying,character varying,character varying,character varying,bigint,jsonb) line 7 at SQL statement
db | 2024-07-10 18:00:36.647 UTC [38] STATEMENT: SELECT procrastinate_defer_periodic_job($1, $2, $3, $4, $5, $6, $7) AS id;
almost all of the errors I run into is because of the above SQL exception
What version of Procrastinate are you using? Have you recently upgraded?
db | 2024-07-10 18:00:36.635 UTC [39] ERROR: column "priority" does not exist at character 31
priority
is a recently added feature. Have you applied the SQL migrations after the package update? (But I am still wondering why this then is a sporadic crash.)
@medihack I am using 2.7.0 version, since I did not lock the package version
and yes it's very frustrating whole worker would crash again and again due to missing column SQL exception
if there are schema changes, package should detect new changes, give warnings or there should be some kind of backward compatibility, regardless I think worker should not crash on such small exceptions
I am unsure if this is related to the new package release 2.7.0, but the worker crashed repeatedly on a small SQL exception. and the error log produced by procrastinate are not helpful at all, unless you look into Postgres DB error logs
Error produced by procrastinate:
app | Main coroutine error, initiating remaining coroutines stop. Cause: ConnectorException('\n Database error.\n ')
app | single_worker error: ConnectorException('\n Database error.\n ')
app | NoneType: None
Error produced by Postgres
Db | 2024-07-11 04:33:31.386 UTC [36] ERROR: Job was not found or not in "doing" status (job id: 75)
Db | 2024-07-11 04:33:31.386 UTC [36] CONTEXT: PL/pgSQL function procrastinate_retry_job(bigint,timestamp with time zone) line 12 at RAISE
Db | 2024-07-11 04:33:31.386 UTC [36] STATEMENT: SELECT procrastinate_retry_job($1, $2);
I am un-clear why would worker crash on such small issues?
if there are schema changes, package should detect new changes, give warnings or there should be some kind of backward compatibility
Yes, it would be great to improve the experience on the migration aspect. This is tracked by #1040. Would you like to contribute?
In the meantime, I invite you to read the documentation on how migrations are handled for this library: https://procrastinate.readthedocs.io/en/stable/howto/production/migrations.html
the error log produced by procrastinate are not helpful at all, unless you look into Postgres DB error logs
The Database Error." happens to be the default message of ConnectorException
. It should however include the inner error in its __cause__
.
This is typically output when logging the entire exception and not just the exception message.
That said, it is possible there is an issue within the library that doesn't output enough details. What do you say @ewjoachim ?
I am un-clear why would worker crash on such small issues?
It may look like a "small" issue, but there is not much the worker could do here.
The schema change introduced in version 2.7.0 is not backward compatible.
The newly introduced priority argument is part of the SQL query sent by Procrastinate 2.7.0 but because the migration has not run, the procrastinate_defer_periodic_job
function does not have the correct list of arguments.
if there are schema changes, package should detect new changes, give warnings or there should be some kind of backward compatibility, regardless I think worker should not crash on such small exceptions
There's backwards compatibility, the other way around (or forward compatibility if you want, it depends how you see it): you can run older versions of the code with newer versions of the schema, this is required for creating a no-downtime migration path.
Having both forward & backward compatibility (being able to run the old code with the new schema or the old schema with the new code) seems like a very complicated thing to do, I think, especially since we don't know how many versions people will skip and if they'll actually end up running the migrations. I'd rather find a way to ensure you can't run procrastinate with an old schema.
Maybe we should also add a more obvious hint on the documentation's start page or the quickstart page that a manual migration step may be required after a package upgrade for non-Django users. And maybe also that the version should be locked to a minor version when using tools like poetry, pipenv, rye, ...
thank you for the help. and makes sense. I think for now locking a version would help
still unclear on why the worker crashes every time on small minor errors like following, which are not related to schema but more related to state of a task
Db | 2024-07-11 04:33:31.386 UTC [36] ERROR: Job was not found or not in "doing" status (job id: 75)
Db | 2024-07-11 04:33:31.386 UTC [36] CONTEXT: PL/pgSQL function procrastinate_retry_job(bigint,timestamp with time zone) line 12 at RAISE
Db | 2024-07-11 04:33:31.386 UTC [36] STATEMENT: SELECT procrastinate_retry_job($1, $2);
the above is a minor state that should not crash the full worke
Have you tried upgrading the package and applying the migrations? Does the problem still exist? What's the retry strategy? How does the task look like?
@medihack I have upgraded the package, it has fixed the schema inconsistency error but I still run into the following error consistently from time to time, even re-starting the worker won't help, the only way is to clear out jobs that are not successful. the worker crash entirely on minor SQL exceptions, following is the exception
Db | 2024-07-11 04:33:31.386 UTC [36] ERROR: Job was not found or not in "doing" status (job id: 75)
Db | 2024-07-11 04:33:31.386 UTC [36] CONTEXT: PL/pgSQL function procrastinate_retry_job(bigint,timestamp with time zone) line 12 at RAISE
Db | 2024-07-11 04:33:31.386 UTC [36] STATEMENT: SELECT procrastinate_retry_job($1, $2);
This is how worker is initialized in FastAPI
task_queue = App(connector=PsycopgConnector(
conninfo=os.getenv("PROCRASTINATE_DB_URL")
))
@asynccontextmanager
async def lifespan(app: FastAPI):
async with task_queue.open_async():
worker = asyncio.create_task(
task_queue.run_worker_async(install_signal_handlers=False)
)
# Set to 100 to test the ungraceful shutdown
await sleep.defer_async(length=5)
print("STARTUP")
yield
print("SHUTDOWN")
worker.cancel()
try:
await asyncio.wait_for(worker, timeout=10)
except asyncio.TimeoutError:
print("Ungraceful shutdown")
except asyncio.CancelledError:
print("Graceful shutdown")
app = FastAPI(lifespan=lifespan)
example jobs with re-try strategy
@task_queue.task(retry=RetryStrategy(
max_attempts=5,
wait=5,
exponential_wait=5
))
async def process_file(file_id: str):
…
@task_queue.periodic(cron="*/10 * * * *")
@task_queue.task(queueing_lock="retry_stalled_jobs", pass_context=True)
async def retry_stalled_jobs(context, timestamp):
…
I wouldn't call it minor. It's something that should not happen. From looking at the worker code, the retry happens before the status of the job is changed and the job may be deleted. But I wonder if your retry_stalled_jobs
somehow interferes with your process_file
task. It's unclear to me which procrastinate_retry_job
leads to that error, that of the process_file
task or the retry you probably have in the retry_stalled_jobs
task. Does the Python traceback give more information on that? Do these sporadic errors still happen without the periodic `retry_stalled_jobs´ task?
Running into an issue where the worker crashes now and then and never restarts. I keep getting the following error on periodic runs
single_worker error: ConnectorException('\n Database error.\n ')
And when I looked into Postgres we simultaneously got the following exception
Following is my job setup
I am running the worker in async mode through FastAPI asynccontextmaanger
few times we get the following error. which is not very descriptive either
Any idea what's going on?