Closed MKDan closed 8 years ago
What about --exlude REGEX
?
Will that 'Rule' helps to download only the 1st cache version of the webpage? I haven't tried it
Yeah, it's a bit not easy to use, but if the urls you want to download follow a pattern is doable.
Play here: http://rubular.com/ to get a sense of regexes.
So do I have to just run this? wayback_machine_downloader http://www.mysite.com --exlude REGEX --only category --exclude "/.(gif|jpg|jpeg|png)$/i" --to 20150601100159
nah, it will look more something like this, not need for --exlude REGEX
:
wayback_machine_downloader http://www.mysite.com --only category --exclude "/page[0-9]+/i" --to 20150601100159
My site url pattern is like http://mysite.com/pet --only pet --exclude "/.(gif|jpg|jpeg|png)$/i" --to 20150601100159
wayback_machine_downloader http://www.mysite.com --only category --exclude "/page[0-9]+/i" --to 20150601100159 ----> I dont want to download images, since they are huge
Hartator, can I just have to use the rule --exclude "/page[0-9]+/i" or what exactly?
You have to learn regex unfortunately. Try regexes here: http://rubular.com/
On Sat, Aug 6, 2016 at 12:01 PM, MKDan notifications@github.com wrote:
Hartator, can I just have to use the rule --exclude "/page[0-9]+/i" or what exactly?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/51#issuecomment-238033733, or mute the thread https://github.com/notifications/unsubscribe-auth/AASxjQwXnuAkCp_bPJ7c6-wp1RCgt4t6ks5qdL3-gaJpZM4JeT5R .
Hartator, all the cache versions of the url are of same pattern like this
The only change is in the date of the cache in the url. I just want the 1st cache
The regex comment (?
@hartator How can I apply the regex while all the urls are just the same
Any possible way? :( This downloads all the cache versions of a given url: https://github.com/jsvine/waybackpack but I need to download all the 1st cache of the urls of a website.
Thanks in advance.
It shouldn't have more than one url by file... Can yous send me the exact command you are typing?
On Saturday, August 6, 2016, MKDan notifications@github.com wrote:
How can I apply the regex while all the urls are just the same
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/51#issuecomment-238035671, or mute the thread https://github.com/notifications/unsubscribe-auth/AASxjYxipn4SsJIttZ457yUujjp-Mg9Lks5qdMbcgaJpZM4JeT5R .
@hartator No, it doesn't have more than 1 url after downloading, but it checks all the urls for downloading. There are so many urls to check like 234539 to reach the final cache version of the page.
@MKDan weird. I can't help without more information. Can you send me the full command you are typing + the result of wayback_machine_downloader --version
? Either here or by email (my username + gmail.com) if you don't want to show your website.
@hartator
http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/3699293 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/502983 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225%2C9164087/808515 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225,9164085 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3695225/9164087 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3699293 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/3699293/3463987 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/502983 http://www.mysite.com/fun/atari-breakout-game-can-be-found-in-google-image-search-awesome-or-too-old/question-3695225/808515
http://www.mysite.com/fun/url-string-goes-here-in-this-part/question-8695725/index.html http://www.mysite.com/fun/something-here-path/question-4695225/index.html http://www.mysite.com/fun/example-path-here/question-13695822/index.html
Now, I want only one page of every url to get downloaded. So just want to download first page of the urls which gets downloaded multiple times, if not possible then just block such urls which downloads the page multiple times.
So now what rule should I use.
Actually the url just looks like http://www.mysite.com/living/which-language-course-would-you-rather-take/question-4219181/ but the wayback downloader just download as http://www.mysite.com/living/which-language-course-would-you-rather-take/question-4219181/index.html
When I check for an url in the waybackmachine, it gives the following:-
http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=2 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/http%3A%2F%2Fwww.mysite.com%2Fliving%2Fare-you-a-leader-or-a-follower%2Fquestion-1323065%2F http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=2 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/http%3A%2F%2Fwww.mysite.com%2Fliving%2Fare-you-a-leader-or-a-follower%2Fquestion-1323065%2F http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/?page=3 http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/adserver.adtechus.com/addyn/3.0/5353.1/2328581/0/16/ADTECH http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065/invite/
But I just want to limit the url to http://www.mysite.com:80/living/are-you-a-leader-or-a-follower/question-1323065 for all questions. Nothing should be after the "question-123456789". I don't want the second and third pages of the same url.
So, add '--only "/question-[0-9]+$/"' or something similar. Play on Rubular if it doesn't work. It's not easy to help you without knowing your website.
This is off topic but I'm so stressed with it. If you could help me then I'm really thankful to you.
I have like 1000 of html pages on my pc downloaded from the cache. I want to copy the selected data from those pages into my database in bulk.
What data I need from those html webpages:
The question title which is in between the tags < title> and < /title> (also present in between < h1> and < /h1> tags). The question description which is in between the < div id="summaryDescription">< /div> All the answer descriptions which are in between div classes < div class="postContent">< /div>
I want to move these to my question answer website.
What advice I got
I've to write a small PHP parser for this and need to parse the data. I can use Notepad ++ to change data from many files at once. Then need to made them as SQL query. And have to use shell to run SQL queries in bulk.
But how do I parse the code? I tried many things but not sure what im doing.
@hartator Can you please provide the step by step instruction for this dumb person.
@MKDan, it's a bit of work, but you should try to learn regex or try to use a gem like nokogori
to make sense of the html.
Since its taking so much time (in my case it's taking weeks!) My website has around 1 lakh pages in it and each page has a cache versions like 234539. So it starts like (1/234539) for each url. I just can't reduce the time frame, this will eliminate many urls.