johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
923 stars 133 forks source link

Slows down after 20,0000+ posts #361

Open KajinStyle opened 5 years ago

KajinStyle commented 5 years ago

I have noticed that with larger blogs that have upwards to 20k or more posts/images to download it beings to drag its feet. You can especially see it at around 50k mark of blogs.

At first I thought it was my internet, or just tumblr itself. However I kept noticing that 20k was the magic number no matter the blog. I could stop any that are in the 50k mark and start up a fresh new blog and it'll rapidly download everything up to a point.

I am wondering if this may have to something to do with windows itself and how many files it can store in one folder or just the program chugging along because it has to compare so many files at once.

johanneszab commented 5 years ago

Hmm, interessting. I've no idea. I could never have seen this, because my connection at home is rather crappy (16MBit/s), thus it always downloads at full speed for me, even for blogs with 300'000+ posts.

Maybe someone else can look into this, see if it's related to the folder lookup or some internal data structure.

KajinStyle commented 5 years ago

My internet can reach around 100mbps on good days. For this it could hit up to 50 Mbps which is great but it could also mean I might be hitting some cap.

My settings are: -Concurrent connections: 100 -concurrent video connections: 5 -Concurrent blogs: 5 -Scan connections: 100 -Timeout: 60 -Limit Tumblr API Connections: Number of connections: 100 per 60

At these settings I rarely see it give me an error of having too many connections but it can happen once or twice within a day of it runs nonstop. So it could be a connection issue or maybe I am trying to download too much too fast and Tumblr is capping me?

Only thing I am certain of is it downloads the first 20k, 30k, even 40k without effort. Afterward it starts to slow down and continues to worsen over time. An 80k blog can take 5-6hrs while a 20k-30k blog is done in 20-30mins.

KajinStyle commented 5 years ago

So I figured it out....

I was having it download image metadata. Don't know why that was checked off but it was and that meta data was going to a text file. That textfile got massive with almost a million lines for just one blog I was downloading that had already downloaded around 190k files.

Once I edited the file to have 1 entry the download speed picked up dramatically. After a bit the speed would drop again. So disabling image metadata removed the slowdown entirely.

johanneszab commented 5 years ago

Thanks a lot @KajinStyle for sharing your results and investigating this behavior!

I can imagine that this happens and is the correct conclusion/cause. How would you like to see it fixed?

Maybe we could split the huge file into smaller separate files and put them in corresponding folders for, say conversations/quotes/.., or use some kind of "sqllite"-kind database and put everything in there. Or we write text file changes in larger batches, hence reduce the amount of disk operation but increase the disk throughput per write.