Tumblr now limits access to its version 1 api.

keokitsune commented 7 years ago

A few days ago I randomly started noticing that after beginning a crawl it would go for a few minutes before a bunch of the blogs would light up red as offline and the crawl would be aborted without notice, if I close the program and reopen I can resume crawling for a few more minutes before it happens again, when I click on a off line blog and select go to website the blog is still up, I also noticed a bunch of my "number of downloads" and "downloaded" are massively out of synch, it will say something like you have 1000 images downloaded which is correct but the "number of downloads" says there's only 100 available which is incorrect. I've tried reverting back a few releases but it still gives the same issue. Is there a way to quickly refresh the entire blog list other than removing and reading every blog manually?

Linktydraa commented 7 years ago

I am having the same issue as well.

Taranchuk commented 7 years ago

I confirm that too. It is impossible to download hundreds of blogs in the queue, and this also applies to mode downloading only the meta-data without images. It seems that the Tumblr's network algorithms have changed. In addition to this, the program also does not download several images from the blogs, it happens randomly with different skipped files.

johanneszab commented 7 years ago

Looks like they are rate limiting the older api now too. Maybe too many people are downloading with TumblThree now and they noticed an increase on their api server. Maybe it's just temporarily, or maybe they start to discontinue the older api version and its infrastructure. Who knows ..

After awhile (like after ~300 connections) all further connections are closed with

The remote server returned an error: (429) Limit Exceeded.

Linktydraa commented 7 years ago

Is there a way for this issue to be resolved?

johanneszab commented 7 years ago

Sure

Stay under the limit by reducing the amount of connections per time.
Stay not under the limit and break after the first 429/Limit Exceeded error, save the position in the api, download all the linked files (images, videos, audio), and continue at the position if this queue is empty and hope the limit is lifted already.
Crawl the whole page and don't use the api anymore.

No quick fix for me though. It's probably also worth to wait and see if it's just temporarily.

Linktydraa commented 7 years ago

Sorry but I'l don't really understand the technical jargon of bullets two and three. What do you mean by them?

Linktydraa commented 7 years ago

for instance, what is "api"?

Linktydraa commented 7 years ago

Also, I have a problem in the queue section. I'll have the setting to download, lets say 5 blogs at a time, and when ti first starts, it does well with downloading 5 blogs, but then as the blogs finish, it doesn't keep going and continuing to download the next blogs. I'll have 70 blogs in queue, but it stops downloading after maybe 10 blogs finish.

keokitsune commented 7 years ago

There's no current way to reset all my blogs without having to remove and re add every single one manually? I have like 800 and I'd rather not do it manually if possible.

Taranchuk commented 7 years ago

keokitsune, copy all filenames of .tumblr files in the index folder, paste it into Notepad ++ or another text editor, go to the "find and replace", enable mode regural expressions, write "^" in the find field and "http://" in the replace field and perform a replace operation, then write "$" in the find field and ".com" in the replace field and perform a replace operation. Then remove all the blogs in the program, click button "copy from the clipboard" and copy all the text from a text editor and then the program will adding them all to blogs manager automatically.

johanneszab commented 7 years ago

There's no current way to reset all my blogs without having to remove and re add every single one manually? I have like 800 and I'd rather not do it manually if possible.

Don't delete them! I've already added code that stores the blog settings in plain text. You can open the files now in your favorite text editor and change anything.

There is also a "check blogs at startup" option in the settings window which checks each blog at during startup.

But this all doesn't matter at all if the download is broken, which it currently is, until someone fixes the hitting of the rate limit of the Tumblr servers.

keokitsune commented 7 years ago

so this is something on tumblrs server end and not something wro0ng with my blog list?

johanneszab commented 7 years ago

so this is something on tumblrs server end and not something wro0ng with my blog list?

Right. They seem to have added a limit of requests. We are just hitting the servers too hard during the "evaluation" (grabbing of links) with too many connections (per time period), so that they block all further incoming connections after a specific threshold (number of connection per time). But there is just nothing in the code right now to handle this..

johanneszab commented 7 years ago

Interesting. Accessing the api from my virtual server with static public ip, dns record and so on, the access to the api isn't rate limited at all.

When I use ab (apache benchmarks) and hit the api with 20000 connections in total with 200 in parallel nothing happens and the server happily answers everything. Just like how it has been for normal people before mid-February.

$ ab -n 20000 -c 200 -v 4 http://demo.tumblr.com/api/read

Maybe there is something else going on ..

kingbode commented 7 years ago

Hi Zab, "am kingbode who commented on your website yesterday , am here trying to help and learn") I started with TumleOne as I mentioned earlier, and then found your app, actually tumble One still working smoothly with me , although it download one file at a time, and not in parallel downloading like what you may achieved, for me your app did not work even for once !!! I don't know why?, and showed me offline under for the blog, I joined github and I will be happy to participate, although you are using WPF and it looked annoying for me as I 'm not used to it + loading your code in my VS gives me many errors like crl-namespace, <some Strings "!! I don't remember as I'm not on my home PC " not within "....TumbleThree.properites !! may be reference issue.

the other point of limits if any , could be solved // and I'm not sure , just guessing// by using proxy IPs to change the IP of the client to deceive the Tumblr Server, it works for me with another website limits downloading images per day!!!

as you mentioned , that may be there is no limiting at all !! which may be true, as I downloaded using TumbleOne ( one file at a time ) what may exceed 20GB in one day form one blog

so the problem is may be when the application try to overwhelm Tumblr Server by parallel downloading with multiple blogs at a time, which may looks like DOS ( denial of service attack ) so the IP may be blocked !! temprarily, please correct if I'm wrong

johanneszab commented 7 years ago

I've told you already it's working for me and for many other people. Tell me more about it, your operating system version, .NET version and any additional information you can get. Since you've apparently installed visual studio it should be easy for you to gather more information on why it's not working. Just use the debugger.

As a programmer or someone who wants to become one, you should know that:

did not work even once for me!

is nothing you want to hear as a programmer. I cannot do anything useful with that information. I'm sorry if this sounds harsh, but that's how it is. If I cannot reproduce your error, I cannot fix it. There is no need for you to run around and complain everywhere. I would help you, but with your given information, I am simply unable to do so.

Since TumblOne uses basically the same method to access the website, there must be some difference in your system.

As for the limit. There obviously is one as many people here have already reported. And I wouldn't implement a limiter if there is no need for one. 20GB in one day is nothing. It equal to 0.25MB per seconds. If you call that fast, its up to you.

kingbode commented 7 years ago

Hi , Zab, sorry if I may bothered you by my silly comment, but actually this is truly what happened for me , this also was surprising me as I know you reconstructed Helena app again and added features, so when I tried it , it didn't work properly , meaning, I tried the exe file not the code, then it gave me the blog as offline, even today I tried another version and also the app tries to scan but no files being downloaded ( tumbleThree v1.0.5.1) , and some times not working properly, my system is windows 7, VS 4.5, what makes me not free to debug, that I was working on TumbleOne reconstruction, any way I'm here to help if I can, and not to bother anyone, am just expressing what happened with me,

and now I'm ready to debug your code, and forgive my previous aggressive start , so starting from V1.0.7:

for TumbleTwo1.0.7 form application, it's loaded ok in VS and run with no errors , but shows the blog offline, I click on start crawl, it check the blog and become online and then nothing, I break all and found the code at Task.Delay(4000, ct).Wait();,,,debuged more and found that it's stuck in while(true) loop and the "bin" is 0, that's why nothing being downloaded....

so why bin list of blog is empty and not loaded with any blog

and also lblProcess.text does not have activeBlog !! although

lblProcess.Text = "Crawling Blogs -- " + String.Join(" - ", activeCrawlList.Select(activeBlog => activeBlog.Name).ToArray()) for (int i = 0; i < Properties.Settings.Default.configSimultaneousDownloads; i++) taskList[i] = Task.Run(() => RunCrawler(bin, cancellation.Token, pause.Token),

regarding 20GB , I did not mean speed, I meant download runs flawlessly and nothing limited downloading 18000+ files from one blog,

I will work more on debugging to discover why !!! bin in not loaded with list of blogs.

kingbode commented 7 years ago

follow up, it's my mistake !!! I found that I have to add blog then add to queue !!! now crawl is working....

you should guide us by disabling crawl until queue is filled :octocat:

now Jumping to TumbleThree, it's working also, great !!, but where is the image preview, goes and comes.

but I paused and stopped and still images are downloading... it may be because they are in download list!!

johanneszab / TumblThree

Tumblr now limits access to its version 1 api. #26