meeb / tubesync

Syncs YouTube channels and playlists to a locally hosted media server
GNU Affero General Public License v3.0
1.99k stars 130 forks source link

Large channel issue #344

Open amuilyas opened 1 year ago

amuilyas commented 1 year ago

Firstly thanks for such a great program. I am going to be as detailed and succinct as possible and I did trawl through issues and Google to find a solution.

This might just turn into a best practices guide / initial setup for large channels.

1) Installed MariaDB with utf8 character set and TS is happy with hooks 2) Tested with small 6 video channels and all works as expected 3) Tried it with Linus Tech Tips (LTT) and attempts to index ~5,000 items 4) Set it to download 1080p videos within 7 days with 24hour index 5) Left it overnight and no videos downloaded 6) GUI to Source and Tasks leads to repeated 502 error 7) Updated docker with TUBESYNC_WORKERS=1 8) Restarted TS and still leads to 502 errors 9) Decided to try from scratch delete LTT from CLI via Putty and after 1 hour seems no impact 10) Using adminer the sync_media table size is still large at 1,911,029,760 (assume 1.9GB?). Seems a lot for essentially no video downloads and just index files?

Attached docker compose file and log and CLI command to delete channel In the end I delete the containers and start from scratch.

tubesync_docker_tube.txt tubesync_logs.txt tubesync_docker_mariadb.txt

Putty
meeb commented 1 year ago

What's in your tasks tab in the UI? I'm betting a lot of outstanding tasks. In order of preference tubesync will:

  1. Index a channel to just get the media IDs
  2. Get each media items metadata
  3. Get each media items thumbnails
  4. Download any matching and permitted media items

The 1.9gb database table is probably a bit big but not that excessive, the metadata alone is a 100-300kb JSON document per media item so times that by 5000 and you get close to your reported size.

As for why you have no media items downloaded yet I'd assume it's because your worker is still slowly chomping through 5000 metadata requests and 5000 thumbnails. The worker(s) are intentionally slow so as not to annoy YouTube. Once the initial index is complete it does (usually) work fine with just adding new content as it comes out, the initial index on massive channels can take a long time though.

davidkylenz commented 1 year ago

I have a ~16GB mysql database (on a fast NVMe m.2 drive) with over 40 sources including LTT. I needed to allocate 20GB of ram (or else it would run out and crash without logging), limit the TUBESYNC_WORKERS=1 or else ram usage is increased per worker, then I schedule a script to run "manage.py reset-tasks" every night which is probably the part you are missing. I set all sources to only include last 7 days but I think it scans everything anyway. I set to delete after 30 days but that no longer works so run another custom script. I also can't delete/edit large sources any longer due to 502 timeouts, I've restarted from scratch a few times to get it right.

meeb commented 1 year ago

@davidkylenz that sounds ... pretty awful to work with! There are goals to make tubesync more suitable for larger deployments so hopefully your situation will improve. It obviously wasn't originally tested at that sort of scale and it was more of an itch-scratching project that got popular so it's groaning a bit when used at scale. There's no technical reason it should use anything like that amount of resources, other than the metadata storage size on disk.

amuilyas commented 1 year ago

@davidkylenz thanks for sharing your setup. I am pretty similar in only wanting recent uploads from large channels like LTT. And likewise GUI kicks out 502 errors when trying to access the channel. I think need to modify usage to use TubeSync for whole channel (with fewer videos) archive and YouTube-DL to grab videos from large channels where only interested in latest content. Which is pretty likely what how they are both intended to be uses

I am using 10 year old tech with a Synology 1512+ with 3GB RAM, so highly likely a more current setup would help with the background indexing and heavy lifting. I find it helps to stop all other non-related dockers so TubeSync & YouTube-DL have all the resources on first adding a new channel.

@meeb I have read in the issues for potential to limit TubeSync indexing on large channels. That will be nice to have but I think the way forward for now is using both tools for separate tasks. Unless you have other suggestions, using 1 worker & MariaDB is best tuning I can do aside from better hardware. As always your work is much appreciated and thanks for your replies

davidkylenz commented 1 year ago

It works really well once it has been setup especially with some tweaking in Plex to get the sessions showing correctly, but yeah, I have pushed it well past its limits. It would be even better if the worker/page timeouts were increased to like 10 minutes, so then I could edit & delete sources again :-)

amuilyas commented 1 year ago

Good to know there is hope for me! I am playing with small video channels now to finalise setup. Plex sorting is what I am playing with now.

I trawled through the previous comments and followed as best as I can, but right now it seems to lump a lot of the videos into SPECIALS folder, and duplicating episode details across incorrect episodes and there is no season poster (only a nice to have)

What guides or plugins are you using for Plex? What naming convention are you using in TubeSync for importing?

Apologies for the questions, but I am literally starting out and trying to solve

Vandekieft commented 1 year ago

I am at where I can't delete my one and only source because the channel it too large I am assuming. I did "only download within 7 days, and delete after 14 days. But it got stuck and never downloaded anything. Then When i try to remove the source, I get a 502 error.

Vandekieft commented 1 year ago

Good to know there is hope for me! I am playing with small video channels now to finalise setup. Plex sorting is what I am playing with now.

I trawled through the previous comments and followed as best as I can, but right now it seems to lump a lot of the videos into SPECIALS folder, and duplicating episode details across incorrect episodes and there is no season poster (only a nice to have)

What guides or plugins are you using for Plex? What naming convention are you using in TubeSync for importing?

Apologies for the questions, but I am literally starting out and trying to solve

I recommend YouTube-Agent for plex. It works really well as long as you set up the right naming. CHANNELNAME[CHANNEL_ID] > AnyVideoNameOrINFO[video_ID].ext and have it not show seasons. It will then pull and display everything perfectly

meeb commented 1 year ago

@Vandekieft you can delete large channels using a shell command. See: https://github.com/meeb/tubesync/issues/250

captainnapalm commented 1 year ago

I'm having similar issues with LTT, but the issue is just pulling up the channel source page. I can bring up other channels like MKBHD that has ~1500 media items, but just trying to edit the source is resulting in a 502 Bad Gateway.

tubesync           | 2023-05-26T14:50:35.829059003Z 192.168.1.253 - - [26/May/2023:08:50:35 -0600] "GET /static/styles/tubesync.css HTTP/1.1" 200 34554 "http://192.168.1.234:4848/tasks" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:36.090708642Z 192.168.1.253 - - [26/May/2023:08:50:36 -0600] "GET /static/fonts/roboto/roboto-regular.woff HTTP/1.1" 304 0 "http://192.168.1.234:4848/static/styles/tubesync.css" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:36.093333208Z 192.168.1.253 - - [26/May/2023:08:50:36 -0600] "GET /static/fonts/fontawesome/fa-solid-900.woff2 HTTP/1.1" 304 0 "http://192.168.1.234:4848/static/styles/tubesync.css" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:36.095714096Z 192.168.1.253 - - [26/May/2023:08:50:36 -0600] "GET /static/fonts/roboto/roboto-bold.woff HTTP/1.1" 304 0 "http://192.168.1.234:4848/static/styles/tubesync.css" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:36.098544950Z 192.168.1.253 - - [26/May/2023:08:50:36 -0600] "GET /static/fonts/fontawesome/fa-regular-400.woff2 HTTP/1.1" 304 0 "http://192.168.1.234:4848/static/styles/tubesync.css" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:36.101151382Z 192.168.1.253 - - [26/May/2023:08:50:36 -0600] "GET /static/fonts/fontawesome/fa-brands-400.woff2 HTTP/1.1" 304 0 "http://192.168.1.234:4848/static/styles/tubesync.css" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:44.869405829Z [2023-05-26 08:50:44 -0600] [314] [CRITICAL] WORKER TIMEOUT (pid:5419)
tubesync           | 2023-05-26T14:50:45.601304574Z 2023/05/26 08:50:45 [error] 339#339: *219 upstream prematurely closed connection while reading response header from upstream, client: 192.168.1.253, server: _, request: "GET /source/d06dc35c-c699-4003-bbad-00ec47f24478 HTTP/1.1", upstream: "http://127.0.0.1:8080/source/d06dc35c-c699-4003-bbad-00ec47f24478", host: "192.168.1.234:4848", referrer: "http://192.168.1.234:4848/sources"
tubesync           | 2023-05-26T14:50:45.601369196Z 192.168.1.253 - - [26/May/2023:08:50:45 -0600] "GET /source/d06dc35c-c699-4003-bbad-00ec47f24478 HTTP/1.1" 502 150 "http://192.168.1.234:4848/sources" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:45.601585042Z [2023-05-26 08:50:45 -0600] [5419] [INFO] Worker exiting (pid: 5419)
tubesync           | 2023-05-26T14:50:45.752811310Z 192.168.1.253 - - [26/May/2023:08:50:45 -0600] "GET /favicon.ico HTTP/1.1" 499 0 "http://192.168.1.234:4848/source/d06dc35c-c699-4003-bbad-00ec47f24478" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:50:45.763155756Z [2023-05-26 08:50:45 -0600] [5443] [INFO] Booting worker with pid: 5443
tubesync           | 2023-05-26T14:51:09.646757902Z 192.168.1.253 - - [26/May/2023:08:51:09 -0600] "GET /sources HTTP/1.1" 200 3280 "http://192.168.1.234:4848/tasks" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:51:46.835988086Z [2023-05-26 08:51:46 -0600] [314] [CRITICAL] WORKER TIMEOUT (pid:5416)
tubesync           | 2023-05-26T14:51:47.439612505Z [2023-05-26 08:51:47 -0600] [5416] [INFO] Worker exiting (pid: 5416)
tubesync           | 2023-05-26T14:51:47.442819597Z 2023/05/26 08:51:47 [error] 339#339: *219 upstream prematurely closed connection while reading response header from upstream, client: 192.168.1.253, server: _, request: "GET /source/d06dc35c-c699-4003-bbad-00ec47f24478 HTTP/1.1", upstream: "http://127.0.0.1:8080/source/d06dc35c-c699-4003-bbad-00ec47f24478", host: "192.168.1.234:4848", referrer: "http://192.168.1.234:4848/sources"
tubesync           | 2023-05-26T14:51:47.442955503Z 192.168.1.253 - - [26/May/2023:08:51:47 -0600] "GET /source/d06dc35c-c699-4003-bbad-00ec47f24478 HTTP/1.1" 502 150 "http://192.168.1.234:4848/sources" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
tubesync           | 2023-05-26T14:51:47.571274801Z [2023-05-26 08:51:47 -0600] [5453] [INFO] Booting worker with pid: 5453
perry-mitchell commented 1 year ago

Sorry to hijack this, but I'm getting similar issues. I don't understand why it has to index the entire channel (like LTT for instance) before downloading. If the cut-off time is set, for how old videos can be, can't it get this information from the YT API and not perform any indexing of those items whatsoever? :)

meeb commented 1 year ago

TubeSync doesn't use the YouTube API. It uses yt-dlp which itself basically scrapes the front end of YouTube. The way TubeSync works is it "flat" indexes just the video IDs, so every video on a channel or playlist. This information is literally just the list of media item IDs. Then it has to, one at a time, get the metadata for each media item by looking it up by its ID. It's the metadata that contains the publish date. There is no easy channel/search?from=X&to=Y style interface.

Therefore, you need to a) get a list of every single media item and b) index each media item to get its metadata before you can determine what is in and what is not within a date range. You could do hacks (arbitrarily stop indexing after X items) but as the items are not guaranteed to be in order the only way to actually index media items within a date range is to get the metadata for each item.

This method is already the "low number of requests to YouTube" method and reduces error rates but it has been a year or so since I last looked if there's any additional information available now which might help optimise this.

perry-mitchell commented 1 year ago

Got it, thanks for the explanation @meeb. So as I understand it, every time it does an index, it has to scrape the whole channel again?

In the common case mentioned here, LTT, that's almost 6K videos. I'm just wondering if there's something I'm missing here or it's just a limitation of yt-dlp that we have to deal with until a better way is found, if ever..

meeb commented 1 year ago

No, a full index only occurs once when you add a playlist or channel. Each time the indexer runs again it does a "flat" index again and indexes new media item IDs until it hits an ID which it has already indexed, so just new media items basically. It then looks up the new media item metadata only.