palewire / first-web-scraper

A step-by-step guide to writing a web scraper with Python
https://palewi.re/docs/first-web-scraper/
GNU General Public License v3.0
203 stars 165 forks source link

Beautifulsoup and differences for python3x #4

Closed ghost closed 9 years ago

ghost commented 9 years ago

Thank you for the tutorial! I did it with python3, so there were a few differences. But there's also been a change for BeautifulSoup that affected my install at least.

Here's what I found

Below is the final code with notes on the key differences I encountered. Thanks again!

import csv
import requests

# Installing 'beautifulsoup' failed on my Mac. Going to the BeautifulSoup page I found that it's recommended to install 'BeautifulSoup4', which worked. When importing, use bs4 as shown below

from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response =requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'resultsTable'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
# &nbsp wasn't a problem on the page, but \xaO was. It was simple enough to swap out the two, but I wonder - how would I have the text.replace work for more than one character problem?
        text = cell.text.replace('\xa0','')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

# Here's the key difference for Python3x. Found this using a quick search on stackoverload, of course.

filename = 'inmates.csv'
with open(filename, 'w', newline='') as f:
    writer =csv.writer(f)
    writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
    writer.writerows(list_of_rows)
andylolz commented 9 years ago

@aabroder: I’ve sent a PR to upgrade the tutorial from bs3 to bs4.

The \xa0 characters are still non-breaking spaces, but in unicode. I’ve fixed it a different way (using strip instead of replace) because that should work in both python 2.x and python 3.x.

I much prefer using with open for file handling – that’s a good suggestion, and will work in python 2.x as well.