Issue with bs4 - Githubissues

scraperwiki / code-scraper-in-browser-tool

Just like on ScraperWiki Classic; now a part of QuickCode.

https://quickcode.io

Other

38 stars 8 forks source link

Issue with bs4 #114

Closed dankeemahill closed 10 years ago

dankeemahill commented 10 years ago

Using the scraper-in-browser to write a scraper for a plain, 100-row table for a workshop, and BeautifulSoup() in bs4 isn't soupifying the entire page. Old version of BeautifulSoup soupifies the page properly with scraper-in-browser.

Examples:

https://gist.github.com/danhillreports/6152491

from bs4 import BeautifulSoup bs4 from BeautifulSoup import BeautifulSoup

frabcus commented 10 years ago

Is it this bug?

http://stackoverflow.com/questions/11650700/beautifulsoup-does-not-work-for-some-web-sites/11651200#11651200

If so, add this to the line that makes the soup:

 soup = BeautifulSoup(html.content, "html.parser")

And also if so, it has affected a couple of people, so I need to look at what version of Python/bs4/lxml we use... Help finding a bug reporter in either lxml or bs4 would be really useful!

dankeemahill commented 10 years ago

Looks like that's it! Thanks, I didn't run into that article before opening the issue.

aaja-scraper

frabcus commented 10 years ago

Leaving this open as it's affected two people now. If anyone can find the upstream bugs that'd be great!

frabcus commented 10 years ago

Don't think this is an issue any more.