Closed mdrich closed 6 years ago
Ah, okay! I see what's going on. Do you notice the URL change right before it fails? Since you switched from "match" to "search" the URL still matches, but you are scraping a login page.
There are a few ways to fix this:
There are probably many other fixes, but these are the first that come to mind.
Let me know how it goes! :smile:
Thanks so much for the answer. I've been on holiday for the last few days so I didn't respond sooner. I will try the suggestions this morning. I should have known to look where I changed stuff...
Regards, Mike
On Thu, Nov 23, 2017 at 6:20 AM, Katharine notifications@github.com wrote:
Ah, okay! I see what's going on. Do you notice the URL change right before it fails? Since you switched from "match" to "search" the URL still matches, but you are scraping a login page.
There are a few ways to fix this:
- Rewrite the regex so you can keep using "match". Something like this should work: (/places/default/index|/places/default/view). (recommended fix!)
- Test if 'login' is in the url and if so skip it (in advanced link crawler).
There are probably many other fixes, but these are the first that come to mind.
Let me know how it goes! 😄
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kjam/wswp/issues/5#issuecomment-346592299, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDwo-qMOY2gu7tAg3Uc8rW8MOkvPgJ8ks5s5VUXgaJpZM4QnhlZ .
By changing this line in "advanced_link_crawler.py" back as suggested:
if re.match(link_regex, link):
and calling link crawler with the 2nd parameter as suggested:
r'/(places/default/index|places/default/view)'
It works as expected.
Thanks
If I leave the leading "/" before "places", it fails. Thanks again for your help.
I couldn't get the example on page 57-58 to work before I changed anything, and again, I assume I am doing something incorrectly or have something setup incorrectly. I am running on a Windows 10 system using python 3.6.2.
Thanks for any help you can provide...
I made one change in "advanced_link_crawler.py" as follows:
if re.search(link_regex, link):
match() doesn't work, so I changed it to search() since not beginning of the stringI made some debug modifications to "csv_callback.py" because nothing was being written to the CSV file when the exception was thrown, as follows:
This is my main routine calling everything:
I get these exceptions: File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp1\ScrapedToCSV.py", line 9, in
link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\advanced_link_crawler.py", line 110, in link_crawler
data.extend(scrape_callback(url, html) or [])
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in call
for field in self.fields]
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in
for field in self.fields]
builtins.IndexError: list index out of range
I get this output: Downloading: http://example.webscraping.com/ Downloading: http://example.webscraping.com/places/default/index/1 Downloading: http://example.webscraping.com/places/default/index/2 Downloading: http://example.webscraping.com/places/default/index/3 Downloading: http://example.webscraping.com/places/default/index/4 Downloading: http://example.webscraping.com/places/default/index/5 Downloading: http://example.webscraping.com/places/default/index/6 Downloading: http://example.webscraping.com/places/default/index/7 Downloading: http://example.webscraping.com/places/default/index/8 Downloading: http://example.webscraping.com/places/default/index/9 Downloading: http://example.webscraping.com/places/default/index/10 Downloading: http://example.webscraping.com/places/default/index/11 Downloading: http://example.webscraping.com/places/default/index/12 Downloading: http://example.webscraping.com/places/default/index/13 Downloading: http://example.webscraping.com/places/default/index/14 Downloading: http://example.webscraping.com/places/default/index/15 Downloading: http://example.webscraping.com/places/default/index/16 Downloading: http://example.webscraping.com/places/default/index/17 Downloading: http://example.webscraping.com/places/default/index/18 Downloading: http://example.webscraping.com/places/default/index/19 Downloading: http://example.webscraping.com/places/default/index/20 Downloading: http://example.webscraping.com/places/default/index/21 Downloading: http://example.webscraping.com/places/default/index/22 Downloading: http://example.webscraping.com/places/default/index/23 Downloading: http://example.webscraping.com/places/default/index/24 Downloading: http://example.webscraping.com/places/default/index/25 Downloading: http://example.webscraping.com/places/default/view/Zimbabwe-252 http://example.webscraping.com/places/default/view/Zimbabwe-252 Downloading: http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252 http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252 ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') Unexpected error: <class 'IndexError'>
I get 2 rows added to my CSV file: a header and 1 row of data.