johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
922 stars 133 forks source link

Parse the tumblr search, tumblr tag search and tumblr like/by results into html. #175

Open johanneszab opened 6 years ago

johanneszab commented 6 years ago

It should be possible to enhance the current implementations by parsing the results from the crawler into proper html. Right now the crawler only load the whole pages into a large string and extract photo and videos from it based on their suffix. Parsing the string into html would allow us the grab tags, post dates and post ids per post and probably also expand the crawler to get text posts.

The good thing is that all the infrastructure is already in TumblThree and I've already figure out the pagination mechanism for all three pages (the search, the tagged search and the like/by pages). I also think that none of them can be user-customized, thus it should be the same html for all blogs (unlike for the real blogs that can be themed or java scripted).

So, all what's missing is to use something like HTML Agility Pack and iterate over the right elements to grab the desired information.

Taranchuk commented 6 years ago

Parsing the string into html would allow us the grab tags, post dates and post ids per post and probably also expand the crawler to get text posts.

If so, it will be possible to make an option to download metadata files from these web pages that contain this data and also Photo Caption, Reblog Name, Url with slug, just like metadata files from regular blogs? That would be great news for me. Thank you for the hard work!