kjam / wswp

Code for the second edition Web Scraping with Python book by Packt Publications
129 stars 98 forks source link

I may be doing something wrong, but I'm not certain what. BTW, thanks very much for the book. #5

Closed mdrich closed 6 years ago

mdrich commented 6 years ago

I couldn't get the example on page 57-58 to work before I changed anything, and again, I assume I am doing something incorrectly or have something setup incorrectly. I am running on a Windows 10 system using python 3.6.2.

Thanks for any help you can provide...

I made one change in "advanced_link_crawler.py" as follows: if re.search(link_regex, link): match() doesn't work, so I changed it to search() since not beginning of the string

I made some debug modifications to "csv_callback.py" because nothing was being written to the CSV file when the exception was thrown, as follows:

import sys  # DEBUG ONLY
import csv
import re
from lxml.html import fromstring

class CsvCallback:
    def __init__(self):
        self.filename = r'..\..\data\countries.csv' # CHANGE: to get something written to CSV file before exception thrown
        self.handle = open(self.filename, 'w')      # CHANGE
        try:
            self.writer = csv.writer(self.handle)   # CHANGE
            self.fields = ('area', 'population', 'iso', 'country', 'capital',
                           'continent', 'tld', 'currency_code', 'currency_name',
                           'phone', 'postal_code_format', 'postal_code_regex',
                           'languages', 'neighbours')
            self.writer.writerow(self.fields)
            self.handle.close() # CHANGE: to get something written to CSV file before exception thrown
        except csv.Error as e:
            sys.exit('file {}, line {}: {}'.format(self.filename, self.writer.line_num, e)) # DEBUG ONLY
        except:
            print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY  
            raise

    def __call__(self, url, html):
        if re.search('/view/', url):
            print(url)  # DEBUG ONLY
            tree = fromstring(html)
            try:
                all_rows = [
                    tree.xpath('//tr[@id="places_%s__row"]/td[@class="w2p_fw"]' % field)[0].text_content()
                    for field in self.fields]
            except:
                print(self.fields)  # DEBUG ONLY
                print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY
                raise               # DEBUG ONLY
            self.handle = open(self.filename, 'a')  # CHANGE: to get something written to CSV file before exception thrown
            self.writer = csv.writer(self.handle)   # CHANGE
            self.writer.writerow(all_rows)
            self.handle.close()                     # CHANGE

This is my main routine calling everything:

import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))

from chp2.advanced_link_crawler import link_crawler
from chp2.csv_callback import CsvCallback

link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())

I get these exceptions: File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp1\ScrapedToCSV.py", line 9, in link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback()) File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\advanced_link_crawler.py", line 110, in link_crawler data.extend(scrape_callback(url, html) or []) File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in call for field in self.fields] File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in for field in self.fields]

builtins.IndexError: list index out of range

I get this output: Downloading: http://example.webscraping.com/ Downloading: http://example.webscraping.com/places/default/index/1 Downloading: http://example.webscraping.com/places/default/index/2 Downloading: http://example.webscraping.com/places/default/index/3 Downloading: http://example.webscraping.com/places/default/index/4 Downloading: http://example.webscraping.com/places/default/index/5 Downloading: http://example.webscraping.com/places/default/index/6 Downloading: http://example.webscraping.com/places/default/index/7 Downloading: http://example.webscraping.com/places/default/index/8 Downloading: http://example.webscraping.com/places/default/index/9 Downloading: http://example.webscraping.com/places/default/index/10 Downloading: http://example.webscraping.com/places/default/index/11 Downloading: http://example.webscraping.com/places/default/index/12 Downloading: http://example.webscraping.com/places/default/index/13 Downloading: http://example.webscraping.com/places/default/index/14 Downloading: http://example.webscraping.com/places/default/index/15 Downloading: http://example.webscraping.com/places/default/index/16 Downloading: http://example.webscraping.com/places/default/index/17 Downloading: http://example.webscraping.com/places/default/index/18 Downloading: http://example.webscraping.com/places/default/index/19 Downloading: http://example.webscraping.com/places/default/index/20 Downloading: http://example.webscraping.com/places/default/index/21 Downloading: http://example.webscraping.com/places/default/index/22 Downloading: http://example.webscraping.com/places/default/index/23 Downloading: http://example.webscraping.com/places/default/index/24 Downloading: http://example.webscraping.com/places/default/index/25 Downloading: http://example.webscraping.com/places/default/view/Zimbabwe-252 http://example.webscraping.com/places/default/view/Zimbabwe-252 Downloading: http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252 http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252 ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') Unexpected error: <class 'IndexError'>

I get 2 rows added to my CSV file: a header and 1 row of data.

kjam commented 6 years ago

Ah, okay! I see what's going on. Do you notice the URL change right before it fails? Since you switched from "match" to "search" the URL still matches, but you are scraping a login page.

There are a few ways to fix this:

There are probably many other fixes, but these are the first that come to mind.

Let me know how it goes! :smile:

mdrich commented 6 years ago

Thanks so much for the answer. I've been on holiday for the last few days so I didn't respond sooner. I will try the suggestions this morning. I should have known to look where I changed stuff...

Regards, Mike

On Thu, Nov 23, 2017 at 6:20 AM, Katharine notifications@github.com wrote:

Ah, okay! I see what's going on. Do you notice the URL change right before it fails? Since you switched from "match" to "search" the URL still matches, but you are scraping a login page.

There are a few ways to fix this:

  • Rewrite the regex so you can keep using "match". Something like this should work: (/places/default/index|/places/default/view). (recommended fix!)
  • Test if 'login' is in the url and if so skip it (in advanced link crawler).

There are probably many other fixes, but these are the first that come to mind.

Let me know how it goes! 😄

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/5#issuecomment-346592299, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDwo-qMOY2gu7tAg3Uc8rW8MOkvPgJ8ks5s5VUXgaJpZM4QnhlZ .

mdrich commented 6 years ago

By changing this line in "advanced_link_crawler.py" back as suggested: if re.match(link_regex, link): and calling link crawler with the 2nd parameter as suggested: r'/(places/default/index|places/default/view)' It works as expected. Thanks

If I leave the leading "/" before "places", it fails. Thanks again for your help.