locked tasks from crashed process_tasks process not unlocked

networkjanitor commented 2 years ago

Situation:

22 sources, ~9700 media items
TZ: UTC (matches host tz)
TUBESYNC_WORKERS: 8
DATABASE_CONNECTION: postgresql://[...]

Sometimes the process_tasks process gets killed/exits (for whatever reason) and then is restarted by s6. However, if the original process had locked tasks in the db, then these do not get unlocked, not on restart and not after they are expired. Should the amount of locked 'dead' tasks be equal to BACKGROUND_TASK_ASYNC_THREADS, then the whole task queue is deadlocked until tasks are reset.

As far as I can tell, this is because of a bug in django-background-tasks, related to: https://github.com/arteria/django-background-tasks/issues/239

When using BACKGROUND_TASK_ASYNC_THREADS tasks never expire as per the linked issue. So even if there are locked tasks in the db (by processes which do not longer exist), they wont get unlocked. Same for tasks running longer than MAX_RUN_TIME (thats what the linked issue is actually about).

Ideas to solve this or work around this:

fix this in django-background-tasks or patch it in the version you are using (since it is unmaintained)
create an additonal command/script within the container (and run it via s6), which watches the tasks of the taskmanager via task.locked_by_pid_running() and unlocks them, should it return false.

networkjanitor commented 2 years ago

I think this is also what https://github.com/arteria/django-background-tasks/issues/151 is about, but that issue does not have any description and is four years old by now.

meeb commented 2 years ago

Hi, thanks for the detailed issue. Yes there are a number of issues with django-background-tasks which cause knock-on issues. There is a (long running) process to replace django-background-tasks with celery, the hold-up being it's a lot of work to switch without breaking anyones existing tasks and install as I need to basically port over all existing tasks to celery using and a heartbeat. This is also holding up fixing a load of the heavy tasks (deleting a big source causing a 503/timeout etc.) to background tasks. For an immediate work-around, you can use the reset-tasks command line option, this will delete all the existing (including stale) django-background-tasks tasks and re-create any as needed as new tasks.

networkjanitor commented 2 years ago

For an immediate work-around, you can use the reset-tasks command line option, this will delete all the existing (including stale) django-background-tasks tasks and re-create any as needed as new tasks.

That's exactly what I have been using (in crontab by now), but it's hardly working. Right now the taskqueue usually deadlocks while reset-tasks is still running, if not shortly after reset-tasks finishes, leaving behind ~2500 unprocessed tasks. Which means per handful of newly processed tasks (i.e. metadata/data download), it needs to re-index all sources again and again.

meeb commented 2 years ago

Ah, fair enough. Unfortunately it's unlikely I'm going to be able to support a not updated upstream package to resolve this bug directly. I'd not heard of this particular issue before and not one where reset-tasks didn't at least temporarily fix it. Relying on it on cron should obviously be overkill, but seems required for some installations sadly. I'll update this issue with details on the migration to celery though if you like.

networkjanitor commented 2 years ago

For now I'm using the following new command (placed at /app/sync/management/commands/unlock-tasks.py) to unlock (by deletion and recreation) any locked tasks by dead processes:

from django.core.management.base import BaseCommand
from django.utils.translation import gettext_lazy as _
from django.utils import timezone
from background_task.models import Task

from common.logger import log

class Command(BaseCommand):

    help = 'Unlocks jobs, which are locked by dead processes.'

    def handle(self, *args, **options):
        log.info('Fetching locked tasks...')
        # Iter all locked tasks
        locked_tasks = Task.objects.locked(timezone.now())
        for locked_task in locked_tasks:
            if locked_task.locked_by_pid_running():
                log.info(f'Tasks is still fine: {locked_task}')
            else:
                log.info(f'Tasks needs unlocking: {locked_task}')
                if locked_task.is_repeating_task():
                    # Deleting repeating tasks is bad (would result in no more source
                    # index tasks), so we schedule them before deletion.
                    # This matches the flow in the bg_runner (w/o signals).
                    log.info(f'Re-schedule repeating task: {locked_task}')
                    locked_task.create_repetition()

                # Deleting a locked, non-repeating task should not be too bad,
                # as it should be re-created on the source index task again. 
                log.info(f'Deleting task: {locked_task}')
                locked_task.delete()

                ## Unlocking by setting locked_by and locked_at to None
                ## results in source index tasks being repeated over and over,
                ## since those seem to be the ones crashing process_tasks for me.
                ## Re-execution of the same source index task results in duplicate
                ## download tasks being scheduled, without them ever being executed
                #locked = locked_tasks.filter(pk=locked_task.pk)
                #locked.update(locked_by=None, locked_at=None)

        log.info('Done')

This really seems like some last resort workaround, as I'm deleting the non-repeating tasks instead of rescheduling them. It seems to work so far and it also seems like only the (repeating) source index tasks were crashing the process_tasks process. I'll observe this for a few days and note anything else that comes to mind - just in case someone else stumbles across this issue (and can't solve it by setting TZ correctly, switching away from sqlite or using reset-tasks to fix it), at least until the celery replacement is ready :)

meeb commented 2 years ago

Thanks for the command for anyone else who stumbles over these sorts of issues. Most helpful!

Imanagement commented 6 months ago

@networkjanitor Good job, thank you!

wsdookadr commented 3 months ago

@meeb should @networkjanitor 's command be added to the repo?

meeb / tubesync

locked tasks from crashed process_tasks process not unlocked #252