johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
922 stars 133 forks source link

Some hidden Tumblr blog posts cannot be parsed #217

Open thornate opened 6 years ago

thornate commented 6 years ago

I have two hidden tumblrs added to TumblThree. One of them loads fine, so authentication isn't an issue. The other sits in the queue with a status message 'Evaluated n tumblr blog sites' where 'n' counts from 1-4. I can see that another message appears very quickly then disappears after the 4th count. Is that logged anywhere?

I saw on a different issue that the 'Download reblogged posts' checkbox must be checked for hidden posts, so that's not an issue.

I have upgraded to version 1.0.8.45. I'm running Windows 7.

Can you please give suggestions on how to fix this, or debug it further?

johanneszab commented 6 years ago

Sounds like you did everything right.

Could you clear your cookies once, do the authentication again, and then try the non-working blog again? It is a workaround for people were the authentication seems to have no effect at the first time (#180, #210).

Could you also post the blogs url or send it me per email? Then I'll check it later today and see if I can reproduce the error. Maybe there is a specical character or somethings that cannot be correctly parsed.

I saw on a different issue that the 'Download reblogged posts' checkbox must be checked for hidden posts, so that's not an issue.

That should be fixed already since v.10.8.41+

thornate commented 6 years ago

I tried clearing the cookies and re-authenticating but it didn't work. I'll email you the blog url.

johanneszab commented 6 years ago

Thanks for the blog.

At least one of the posts (I think there are two in total within the 44 posts), cannot be parsed and the error message in Visual Studio is simply:

"Expecting state 'Element'.. Encountered 'Text'  with name '', namespace ''. "

which is quite useless. I'll eventually fix the problem I think, but I don't have time to investigate this further right now. It might be an weirdo emoji as I've noticed there are some, or just some text that messes up the json, and then it cannot be parsed correctly anymore.

As a workaround, in the Details tab you can set the Posts per page down to 1 instead of 50 for this specific blog. TumblThree will then crawl only one post per page (request), and hence only parse one post after another and only discard those two posts that cannot be processed correctly. This way the majority of the blog is still downloadable. The only downside is the slower crawling, but at least it will be working.

thornate commented 6 years ago

Looks like it worked. Thanks!

johanneszab commented 6 years ago

I'll add a notification if this happens so that one can react in decrease the number of posts per page.

Kvothe1970 commented 6 years ago

The workaround works for now. Excellent. I have to check something because I now noticed, that some blogs get crawled in their entirety every single time despite the settings not saying "Force Rescan" Which does not make sense. This is noticeable now, with the crazy time this now takes because of the 1/50 speed for these blogs.

johanneszab commented 6 years ago

I have to check something because I now noticed, that some blogs get crawled in their entirety every single time despite the settings not saying "Force Rescan" Which does not make sense.

It's not implemented in the hidden Tumblr crawler. It's probably possible though.

If the api limit is reached in the normal, non-hidden Tumblr blogs, it doesn't save the last crawled post id either. It might need an update now, that if a download fails because of timeout (which i'm still not 100% satisfied with), that it maybe also doesn't save the last crawled post ID, hence recrawls everything.

Kvothe1970 commented 6 years ago

Is there anything I can do, to help to confirm this? It might also be, that some of there were changed form normal to hidden or vice versa and were added again. Let me know if you need me to do specific tests and extract information from the files etc. please. I would love to help.

EC-O-DE commented 6 years ago

Parsing error not only with protected blogs..

EC-O-DE commented 6 years ago

hey btw this https://github.com/ScriptSmith/reaper was updated few hours ago and they fixed some Tumblr stuff...

johanneszab commented 6 years ago

I think I might have fixed this now for good. Since I'm not download so much personally, let me know if it still happens and i'd like to get another (small!) example blog in that case.

Thanks!

Kvothe1970 commented 6 years ago

Looking better now. I have an issue still with multiple blogs reporting that I need to be signed in, whilst being signed in. Will determine factors as soon as I have time to run a few tests. But now the blogs that used to stop after a few posts work, thank you!

thornate commented 6 years ago

It looks like it downloads some of the files, but not all. I'm getting 120 downloaded files out of 202 for the Tumblr I emailed to you, and it drops from the queue with the progress bar only partway through.

johanneszab commented 6 years ago

Thanks @thornate. Then I think it's on the state before Tumblr changed its APIs/website, which is okay for me.

To fix the all the parsing errors and to be prepared for future changes, it might be an idea to use an external json library like json.net. It's way easier to code and can handle unknown data by just ignoring it. The DataContractJsonSerializer we currently use that is part of the .NET framework however cannot and simply fails if it detects unknown structures. And I think thats what happens here. Some posts probably have some json like string embedded and then the parsing doesn't work.

Since I want TumblThree as small as possible, I might have a look again at some point later before doing the switch. Thanks for remembering of the email with the blog name.