hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.36k stars 157 forks source link

Lag and downloader death introduced between 448 and 452 #971

Open alidan opened 3 years ago

alidan commented 3 years ago

Hydrus version

started on 447 with issues cropping up on 451 and 452

Operating system

Windows other (specify in comments)

Install method

Installer

Install and OS comments

windows 7 64bit 64gb of ram

Bug description and reproduction

  1. I started at 447 with a 42 million weight session
  2. I updated straight to 451
  3. downloading wasnt working at all there
  4. 452 pre hotfix still broke with a bit of lag
  5. 452 post hotfix added so much lag the program was hard to use along with downloaders jamming even after resetting 20+ times
  6. I removed the session (saved it off) and tested with a full new set of downloaders and fully loaded them, they did not fail, so my guess is something in the updated may have borked the old downloaders or possibly something in between client - 2021-08.log here causes downloaders to fail with heavy weight sessions.

Log output

Ok a few things I want to point out, there was a bug during 447 which deleted files instead of moving them to trash, it also noticed one of the repository files was missing and moved to recheck them all, these files are ones I removed manually a long time ago due to requiring space and not requiring those files anymore. 

between the move to 452 and the trimmed session, I managed to cull off the session so it went from 42 million to 27~ million weight, this did nothing to lag

not sure how useful the log will be but hope it helps narrowing down the issue.
alidan commented 3 years ago

Ok the problem with downloaders jamming up has happened again, I cut the session and restarted making it, and tried to load test it, essentially adding every single thread on 4chan I had a passing interest in. Initially, everything added and was a non issue, but once the session crept up to around 1.5-2 million hydrus needs constant restarts. I removed my old downloader I was passing and tried again in 453 but the issue seems to persist even without the old downloader.

alidan commented 3 years ago

version 454 still has downloaders jamming, however instead of pending they seem to jam at initializing now. client_2021-09-08_23-21-15

my current session weight is 4.4 million and I have loaded it as hard as I can with downloaders as I can without creating a need to remove everything they downloaded, a good chunk of this is load for load sake to test, another chunk if I don't like getting rid of watchers because for parsing sake having them grouped by theme is too useful.

there are in total 1594 watchers there are 972 watchers that are "active" (threads that may be still going or just haven't 404'd yet) and there are 599 dead threads.

I think a good chunk of these were pending when I updated, seeing as I woke up updated and went back to sleep, so everything may have just failed due to it being hammered for requests. it my have gotten stuck elsewhere, not sure, i'm going to restart cycle this till everything is no longer pending or downloading and see if it still gets stuck

alidan commented 3 years ago

Ok fully rechecked, that was far less painful then it was with the prior version where it could stop almost immediately, however I have had a few downloader hangs, but not that everything is clear this will be the most fair.

alidan commented 3 years ago

5 hours since, none of the watchers have stopped checking however there are some things stuck on initializating client_2021-09-09_06-54-38 overall this is far better then it was before 454 going to stress test it a bit in a few days when I can load up a lot of threads and see how it handles that.

alidan commented 3 years ago

okas of about 40 minutes ago, thing were still moving, however these two hangs have now caused watchers to start hanging on pending, restarting and seeing when the next time this happens is.

alidan commented 3 years ago

ok, after adding a few watchers, I think I may have just been incredibly lucky with how smooth things where, it still feels marginally easier to recover pending watchers and they may not fall into a pending loop as easily, im unsure.

alidan commented 3 years ago

ok now, after a few days of not getting a watcher hang or an download getting stuck, I added around 200-250 watchers as a stress test, and nothing hanged, i'm not 100% sure what happened, because I was getting hangs before, but now it seems to be going fine.

hydrusnetwork commented 3 years ago

Hey, I apologise for not responding to this thread earlier. I have used your report while working on parts of this problem for several weeks now. I hope tomorrow's release will have another improvement.

This 'getting stuck on "initialising..."' while also having a position of 'running' is an odd one. It seems that these jobs are being scheduled and are getting through bandwidth and login checks without any delays, but once they try to actually make a connection and start sending bytes, nothing happens. Could be a thread deadlock, could be some OS level network problem causing huge delays, could be something else.

Although your latest posts say things are better, I will keep working in and around this code. Some of these downloader scheduling problems are due to a legacy bottleneck. I do have a plan to completely remove this, and allow hundreds of downloaders to operate in parallel, but it will be a slightly larger rewrite of the core import job.

alidan commented 3 years ago

Ok, like I said on one of my responses, I have absolutely no idea why, when I first moved to 454 I was still getting issues, but I think 2 sticking points (about 12 restarts in total) I haven't had a single issues in about 19 days, and this is after dumping enough files into downloaders to get 50gb~ of files, i have since stopped stressing it, but it seems the problem no longer exists. my only real thought on it was if this issue did not go away a fall back of 'if x takes y long, assume dead and free up/add additional slots' but it seems its not needed anymore.