meeb / tubesync

Syncs YouTube channels and playlists to a locally hosted media server
GNU Affero General Public License v3.0
1.86k stars 119 forks source link

Not deleting media older than "days to keep" #206

Open rrediske opened 2 years ago

rrediske commented 2 years ago

First, thank you for this wonderful way to download videos automatically!

The problem I am having is that tubesync isn't removing videos from any of the four channels I have it watching, so I will eventually run out of disk space if I can't find a way to delete old videos. I used your basic 13 line docker-compose.yml, so I have nothing special for configuration. When adding each channel, I set the "download cap" to 1 week and the "days to keep" to 10 days, but tubesync still has every video it's ever downloaded going back to December 13th of last year (I'm now at 63 videos).

In a previous install on another machine, I tried deleting videos manually from the mounted volume "tubesync-downloads", but that seemed to cause tubesync to stop downloading anything else, so I wound up moving to a new machine to start over. It takes almost 2 full days for tubesync to index all the videos of the 4 channels I watch, so I really want to avoid having to do that.

Any ideas? Am I doing something wrong?

meeb commented 2 years ago

Thanks for the comments! That could well be a bug. I'll leave this open and investigate. I'm assuming you're on normal-ish Linux and not running it with weird WSL paths on Windows or anything?

rrediske commented 2 years ago

Open SuSE Leap 15.3, so... weird enough, but normal-ish :) I'm running a home assistant and a nextcloud docker image in the same machine and they have been fine for a few months. It's a VM on a Dell R710, running ESXi, and the VM has 8 GB RAM, 120 GB disk.

meeb commented 2 years ago

Thanks for the details. If you can easily search the container logs are there any errors? If for some reason it's attempting to clear up files that don't exist due to an invalid path or similar issue there should be a log of it.

rrediske commented 2 years ago

I did docker logs 8703 >& abc then did a grep ERROR on that and I got 39 errors in the last 24 hours of the form:

2022-01-19 19:46:49,576 [tubesync/ERROR] ERROR: [youtube] 7wLxM7oNN1s: This video is unavailable on this device.

39 sounds like the number of videos that should be getting deleted, might be a coincidental number, though.

Here's more context:

Rescheduling task Downloading metadata for "21e4dd9c-6e1e-4bff-ac85-df44d526a059" for 5:45:41 later at 2022-01-19 11:35:05.305812+00:00
2022-01-18 23:49:28,602 [tubesync/DEBUG] [youtube] 7wLxM7oNN1s: Downloading webpage
2022-01-18 23:49:29,099 [tubesync/DEBUG] [youtube] 7wLxM7oNN1s: Downloading android player API JSON
2022-01-18 23:49:29,312 [tubesync/ERROR] ERROR: [youtube] 7wLxM7oNN1s: This video is unavailable on this device.
Rescheduling Downloading metadata for "672d9be1-840c-49ed-825a-ae1e12a43fc4"
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/background_task/tasks.py", line 43, in bg_runner
    func(*args, **kwargs)
  File "/app/sync/tasks.py", line 227, in download_media_metadata
    metadata = media.index_metadata()
  File "/app/sync/models.py", line 1235, in index_metadata
    return indexer(self.url)
  File "/app/sync/youtube.py", line 50, in get_media_info
    raise YouTubeError(f'Failed to extract_info for "{url}": No metadata was '
sync.youtube.YouTubeError: Failed to extract_info for "https://www.youtube.com/watch?v=7wLxM7oNN1s": No metadata was returned by youtube-dl, check for error messages in the logs above. This task will be retried later with an exponential backoff.
Rescheduling task Downloading metadata for "672d9be1-840c-49ed-825a-ae1e12a43fc4" for 5:45:41 later at 2022-01-19 11:35:10.319907+00:00
meeb commented 2 years ago

That's a "normal" error, that video ( https://www.youtube.com/watch?v=7wLxM7oNN1s ) does seem to be actually unavailable so TubeSync is correct there. Thanks, I'll check the old media cleanup code.

rrediske commented 2 years ago

I could share my screen via something like Discord or Jitsi if that helps, type whatever so you can see output. That's the only ERROR line type. The log file abc is 24,700 lines long just for the last day, lol.

meeb commented 2 years ago

Thanks for the offer, but it's probably not too useful right now. If there was an attempt to delete the wrong path in a cleanup it would have left a note in the error log. It should be relatively easy to trace why the clean-up isn't firing.

rrediske commented 2 years ago

grep -i clean shows 13 lines with the word clean, but they're all in video titles.

rrediske commented 2 years ago

I don't know if this helps, but if I hit skipped media, it shows 55 pages of 144 videos each, so that's almost 8000 videos for it to index. Maybe it's too many for it to handle? One channel by itself is somewhere around 5000 videos.

meeb commented 2 years ago

That amount of media should be fine, it's likely an issue detecting if the file exists or some other sort of path issue. If the the media has been downloaded already and exists on your local disk it must have been indexed properly already, so the max-age deletion should pick it up.

rrediske commented 2 years ago

I decided to remove some of the downloads manually to recover disk space yesterday around 10:30 AM server time. I did a docker exec into the container and then rm 2021, rm 2022-01-0 and rm 2022-01-1* in each directory inside /downloads (removing anything older than 11 days, all the sources are configured in tubesync to remove items older than 10 days).

Here's the log file for the last two days: https://docs.rediske.org/2022-02-01.txt

mcinj commented 2 years ago

@rrediske, in the UI on the media tab, do your episodes that you expect to be deleted show a "downloading" text?

rrediske commented 2 years ago

I'll have to give tubesync some time and look tomorrow, the oldest media there is dated 1/22/22, 10.5 days ago. Nothing shows downloading right now.

image

rrediske commented 2 years ago

The dashboard no longer shows videos older than 1/23:

image

But the list of files still goes back to 1/20 (I manually deleted everything older than 1/19):

image

MatthK commented 1 year ago

I just discovered, that my TubeSync also fails to delete older content. I had it set for 10 days, but I still have all the content from the last half year. I changed the days to keep it, but that didn't trigger a delete either. I went through the log file, but searching for "clean" only brought up hits in the name of the videos. There are a lot of lines with "error", however the few dozens I checked, where all related to videos it couldn't download. I opened a terminal session and then tried a rm filename and the file got deleted immediately. So I would assume, it should not be an issue with file permissions (my download directory is on a mounted directory). And TubeSync also runs in a docker container. I also updated the container to the latest version this week, but that seem to not have fixed the issue. It's not urgent and the disk has still remaining space, but should I just delete the old videos manually or keep them for "testing"?

meeb commented 1 year ago

I'm not aware of this not functioning in the current release, however the logic might not be entirely clear. As per:

https://github.com/meeb/tubesync/blob/main/tubesync/sync/tasks.py#L134

The cleanup_old_media() function is called every time a source is indexed. Media is deleted with the following log message (which will be in the container logs):

                log.info(f'Deleting expired media: {media.source} / {media} '
                         f'(now older than {media.source.days_to_keep} days / '
                         f'download_date before {delta})')

The media deletion should be triggered if the following conditions are met:

  1. The media is downloaded
  2. The media download date is not null
  3. The source of the media has "delete old media" enabled
  4. The source of the media has "days to keep" set to an integer
  5. The media download date is older than the current date minus the days to keep

The clean-up code is relatively simple and I can't obviously see any issues with it. If you can confirm the above 5 prerequisites are met and your media still isn't being deleted let me know and I'll pop a bunch of debug logging into the tasks to work out what isn't firing on your installation.

MatthK commented 1 year ago

I checked the log again for "Deleting expired" and found the following entry (among others):

2023-01-25T01:38:33.709029102Z 2023-01-25 09:38:33,708 [tubesync/INFO] Deleting expired media: Formel 1 / wycpxkxIWk0 (now older than 14 days / download_date before 2023-01-11 01:38:33.708724+00:00)
2023-01-25T01:38:33.709975984Z 2023-01-25 09:38:33,709 [tubesync/INFO] Deleting tasks for media: Als Ayrton Senna fast für Ferrari gefahren wäre!
2023-01-25T01:38:33.712439282Z 2023-01-25 09:38:33,712 [tubesync/INFO] Scheduling media server updates
2023-01-25T01:38:33.716321202Z 2023-01-25 09:38:33,716 [tubesync/INFO] Deleting expired media: Formel 1 / b3QtosB64Jg (now older than 14 days / download_date before 2023-01-11 01:38:33.716211+00:00)
2023-01-25T01:38:33.717626928Z 2023-01-25 09:38:33,717 [tubesync/INFO] Deleting tasks for media: 10 F1-Rekorde, die "Schumi" 2023 verlieren könnte
2023-01-25T01:38:33.719752753Z 2023-01-25 09:38:33,719 [tubesync/INFO] Scheduling media server updates
...
2023-01-26T01:38:29.186077595Z 2023-01-26 09:38:29,185 [tubesync/INFO] Deleting completed tasks older than 7 days (run_at before 2023-01-19 01:38:29.185964+00:00)
2023-01-26T01:38:30.710721535Z 2023-01-26 09:38:30,710 [tubesync/INFO] Deleting expired media: Formel 1 / DjpAsYab0n0 (now older than 14 days / download_date before 2023-01-12 01:38:30.710454+00:00)
2023-01-26T01:38:30.711765598Z 2023-01-26 09:38:30,711 [tubesync/INFO] Deleting tasks for media: „Du kommst hier nicht rein“: warum die F1 Andretti blockiert!
2023-01-26T01:38:30.714079051Z 2023-01-26 09:38:30,713 [tubesync/INFO] Scheduling media server updates

Now while it seems that TubeSync is deleting the files, they still exist on the disk. When I "View media linked to this source" however, I can only see three episodes, while all previous ones appear now as "Skipped".

onedayfishsale commented 1 year ago

I'm seeing this as well on 0.12.0. Running in Docker and using a host-mounted NFS share for /downloads.

tubesync  | 2023-02-08 08:30:13,091 [tubesync/INFO] Deleting expired media: Munro Live / Ehnjhj8WFG4 (now older than 14 days / download_date before 2023-01-25 13:30:13.091025+00:00)
$ ls *Ehnjhj8WFG4*
2023-01-24_munro-live_100-million-lines-of-code-the-state-of-automotive-software-ces-2023_Ehnjhj8WFG4_1080p-vp9-opus.info.json
2023-01-24_munro-live_100-million-lines-of-code-the-state-of-automotive-software-ces-2023_Ehnjhj8WFG4_1080p-vp9-opus.jpg
2023-01-24_munro-live_100-million-lines-of-code-the-state-of-automotive-software-ces-2023_Ehnjhj8WFG4_1080p-vp9-opus.mkv
2023-01-24_munro-live_100-million-lines-of-code-the-state-of-automotive-software-ces-2023_Ehnjhj8WFG4_1080p-vp9-opus.nfo
$ 
eocx commented 1 year ago

I am facing the same issue in TubeSync version 0.12.1 running in a Docker container with media files located on the host system.

@meeb, you explained

As per:

https://github.com/meeb/tubesync/blob/main/tubesync/sync/tasks.py#L134

The cleanup_old_media() function is called every time a source is indexed. Media is deleted with the following log message (which will be in the container logs):

                log.info(f'Deleting expired media: {media.source} / {media} '
                         f'(now older than {media.source.days_to_keep} days / '
                         f'download_date before {delta})')

With commit 410906ad8eeec03c34723cda18eba21f8c742cab the media file deletion was removed from the media_pre_delete() function of tubesync/sync/signals.py. Since this change, the deletion of associated media files takes place at various functions defined in tubesync/sync/views.py for the user interactive scenarios.

Presumably, the file deletion should now also be done explicitly triggered by cleanup_old_media() for the thumbnail, media, NFO and JSON files?

meeb commented 1 year ago

Yeah file deletion can probably be put back into signals now. If I recall some logic was moved because some users reported it was erroneously deleting media and causing issues and there are issues with the tasks firing reliably (which is a much larger ongoing issue with attempting to replace the entire tasks system).

meeb commented 1 month ago

This should have been resolved for quite some time. I'll close this for now. Please create a new issue if you still experience this.