rohanbk / Mountain-Project-Scraper

Python scrapy-based repository for mining information associated with MountainProject
6 stars 3 forks source link

URLs and selectors are outdated #10

Open endolith opened 1 year ago

endolith commented 1 year ago
    domain = 'https://www.mountainproject.com'

    # URL should be preceded by a /
    # e.g. /destinations or /v/STATENAME/ID
    relativeURL = '/v/hawaii/106316122'

    start_urls = [domain + relativeURL]
    allowed_domains = ['mountainproject.com']
    rules = [
        Rule(
            LinkExtractor(allow='v/(.+)'),
            callback='parse',
            follow=True
        )
    ]

The /v/ URLs redirect to a new scheme:

2023-01-15 11:19:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.mountainproject.com/viewer-old/106316122> from <GET https://www.mountainproject.com/v/hawaii/106316122>
2023-01-15 11:19:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.mountainproject.com/area/106316122/hawaii> from <GET https://www.mountainproject.com/viewer-old/106316122>
        if self.relativeURL != '/destinations':
            # use the following links variable if testing from an individual state page (e.g. WA states routes)
            links = response.css('#viewerLeftNavColContent a[target="_top"] ::attr(href)').extract()

<div id="viewerLeftNavColContent" class="rspCollapsedContent"> was present in old pages: https://web.archive.org/web/20161122233413/http://www.mountainproject.com/v/alabama/105905173

but no longer.

        else:
            # use the following links variable if testing from the homepage
            links = response.css('span.destArea a::attr(href)').extract()

<span class="destArea"> was present on old homepage:

https://web.archive.org/web/20171016232313/https://www.mountainproject.com/

but no longer.

endolith commented 1 year ago

New URLs probably need something like this:

    relativeURL = '/area/106316122/hawaii'

    start_urls = [domain + relativeURL]
    allowed_domains = ['mountainproject.com']
    rules = [
        Rule(
            LinkExtractor(allow='area/(.+)'),
            callback='parse',
            follow=True
        )
    ]
endolith commented 1 year ago

New state pages have

<div class="col-md-3 left-nav float-md-left mb-2">
                <div class="mp-sidebar">

So probably links = response.css('.left-nav a::attr(href)').extract()?

And on the main page it has

<div class="col-xs-12">
        <div class="title-with-border-bottom mb-2">
            <h2 class="inline-block mr-half">Rock Climbing Guide</h2>
        </div>
        <div class="row" id="route-guide">

So probably links = response.css('div#route-guide a::attr(href)').extract()?

Still doesn't work, though.

endolith commented 1 year ago

DEBUG: Filtered offsite request to 'www.mountainproject.comhttps': <GET https://www.mountainproject.comhttps//www.mountainproject.com/map/106316122/hawaii>

yield scrapy.Request(url, callback=self.parse_coordinates)
endolith commented 1 year ago

I'm not sure why the original code says this:

        if 'Location' not in response.css('#rspCol800 div.rspCol table tr:nth-child(2) td ::text').extract()[0]:
            return response.css('#rspCol800 div.rspCol table tr:nth-child(3) td ::text').extract()[1].strip()
        else:
            return response.css('#rspCol800 div.rspCol table tr:nth-child(2) td ::text').extract()[1].strip()

In the case that it doesn't list Location:, then what is it getting instead?

https://web.archive.org/web/20161115082401/http://www.mountainproject.com/v/central-pillar-of-frenzy/105862930

for example.

(Now in the new layout it's "GPS:", though.)

endolith commented 1 year ago

(I've got it working, but I made a bunch of clunky changes with the help of ChatGPT that I don't fully understand)