hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.37k stars 716 forks source link

I just want only the 1st cache of an URL #51

Closed MKDan closed 8 years ago

MKDan commented 8 years ago

Since its taking so much time (in my case it's taking weeks!) My website has around 1 lakh pages in it and each page has a cache versions like 234539. So it starts like (1/234539) for each url. I just can't reduce the time frame, this will eliminate many urls.

hartator commented 8 years ago

What about --exlude REGEX?

MKDan commented 8 years ago

Will that 'Rule' helps to download only the 1st cache version of the webpage? I haven't tried it

hartator commented 8 years ago

Yeah, it's a bit not easy to use, but if the urls you want to download follow a pattern is doable.

hartator commented 8 years ago

Play here: http://rubular.com/ to get a sense of regexes.

MKDan commented 8 years ago

So do I have to just run this? wayback_machine_downloader http://www.mysite.com --exlude REGEX --only category --exclude "/.(gif|jpg|jpeg|png)$/i" --to 20150601100159

hartator commented 8 years ago

nah, it will look more something like this, not need for --exlude REGEX:

wayback_machine_downloader http://www.mysite.com --only category --exclude "/page[0-9]+/i" --to 20150601100159

MKDan commented 8 years ago

My site url pattern is like http://mysite.com/pet --only pet --exclude "/.(gif|jpg|jpeg|png)$/i" --to 20150601100159

MKDan commented 8 years ago

wayback_machine_downloader http://www.mysite.com --only category --exclude "/page[0-9]+/i" --to 20150601100159 ----> I dont want to download images, since they are huge

MKDan commented 8 years ago

Hartator, can I just have to use the rule --exclude "/page[0-9]+/i" or what exactly?

hartator commented 8 years ago

You have to learn regex unfortunately. Try regexes here: http://rubular.com/

On Sat, Aug 6, 2016 at 12:01 PM, MKDan notifications@github.com wrote:

Hartator, can I just have to use the rule --exclude "/page[0-9]+/i" or what exactly?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/51#issuecomment-238033733, or mute the thread https://github.com/notifications/unsubscribe-auth/AASxjQwXnuAkCp_bPJ7c6-wp1RCgt4t6ks5qdL3-gaJpZM4JeT5R .

MKDan commented 8 years ago

Hartator, all the cache versions of the url are of same pattern like this

  1. http://web.archive.org/web/20110618010734/http://www.mysite.com/fun/do-you-like-roller-coasters/question-1887981/
  2. http://web.archive.org/web/20110815000000/http://www.mysite.com/fun/do-you-like-roller-coasters/question-1887981/
  3. http://web.archive.org/web/20121101000000/http://www.mysite.com/fun/do-you-like-roller-coasters/question-1887981/
  4. http://web.archive.org/web/20130501000000/http://www.mysite.com/fun/do-you-like-roller-coasters/question-1887981/

The only change is in the date of the cache in the url. I just want the 1st cache

MKDan commented 8 years ago

The regex comment (?\d{1,2})\/(?\d{1,2})\/(?\d{4}) will check for the date where month is 1or2, while day is 1or2 and year as 4.

MKDan commented 8 years ago

@hartator How can I apply the regex while all the urls are just the same

Any possible way? :( This downloads all the cache versions of a given url: https://github.com/jsvine/waybackpack but I need to download all the 1st cache of the urls of a website.

Thanks in advance.

hartator commented 8 years ago

It shouldn't have more than one url by file... Can yous send me the exact command you are typing?

On Saturday, August 6, 2016, MKDan notifications@github.com wrote:

How can I apply the regex while all the urls are just the same

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/51#issuecomment-238035671, or mute the thread https://github.com/notifications/unsubscribe-auth/AASxjYxipn4SsJIttZ457yUujjp-Mg9Lks5qdMbcgaJpZM4JeT5R .

MKDan commented 8 years ago

@hartator No, it doesn't have more than 1 url after downloading, but it checks all the urls for downloading. There are so many urls to check like 234539 to reach the final cache version of the page.

hartator commented 8 years ago

@MKDan weird. I can't help without more information. Can you send me the full command you are typing + the result of wayback_machine_downloader --version? Either here or by email (my username + gmail.com) if you don't want to show your website.

MKDan commented 8 years ago

@hartator

Now I have spotted the problematic url paths:- The below urls downloads the same page multiple times and only half of the urls are like this.

http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/3699293 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/502983 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/808515 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225,9164085 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225/9164087 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3699293 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3699293/3463987 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/502983 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/808515

Other URLs which downloads the page only one time correctly will look like the below:

http://www.mysite.com/fun/url-string-goes-here-in-this-part/question-8695725/index.html http://www.mysite.com/fun/something-here-path/question-4695225/index.html http://www.mysite.com/fun/example-path-here/question-13695822/index.html

Now, I want only one page of every url to get downloaded. So just want to download first page of the urls which gets downloaded multiple times, if not possible then just block such urls which downloads the page multiple times.

So now what rule should I use.

MKDan commented 8 years ago

Actually the url just looks like http://www.mysite.com/living/which-language-course-would-you-rather-take/question-4219181/ but the wayback downloader just download as http://www.mysite.com/living/which-language-course-would-you-rather-take/question-4219181/index.html

MKDan commented 8 years ago

When I check for an url in the waybackmachine, it gives the following:-

http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=2 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/http%3A%2F%2Fwww.mysite.com%2Fliving%2Fare-you-a-leader-or-a-follower%2Fquestion-1323065%2F http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=2 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/http%3A%2F%2Fwww.mysite.com%2Fliving%2Fare-you-a-leader-or-a-follower%2Fquestion-1323065%2F http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=3 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/adserver.adtechus.com/addyn/3.0/5353.1/2328581/0/16/ADTECH http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/invite/


But I just want to limit the url to http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065 for all questions. Nothing should be after the "question-123456789". I don't want the second and third pages of the same url.

hartator commented 8 years ago

So, add '--only "/question-[0-9]+$/"' or something similar. Play on Rubular if it doesn't work. It's not easy to help you without knowing your website.

MKDan commented 8 years ago

This is off topic but I'm so stressed with it. If you could help me then I'm really thankful to you.

I have like 1000 of html pages on my pc downloaded from the cache. I want to copy the selected data from those pages into my database in bulk.

What data I need from those html webpages:

The question title which is in between the tags < title> and < /title> (also present in between < h1> and < /h1> tags). The question description which is in between the < div id="summaryDescription">< /div> All the answer descriptions which are in between div classes < div class="postContent">< /div>

I want to move these to my question answer website.

What advice I got

I've to write a small PHP parser for this and need to parse the data. I can use Notepad ++ to change data from many files at once. Then need to made them as SQL query. And have to use shell to run SQL queries in bulk.

But how do I parse the code? I tried many things but not sure what im doing.

MKDan commented 8 years ago

@hartator Can you please provide the step by step instruction for this dumb person.

hartator commented 8 years ago

@MKDan, it's a bit of work, but you should try to learn regex or try to use a gem like nokogori to make sense of the html.