hmcuesta / PDA_Book

Code Examples Data Science using Python
157 stars 146 forks source link

Chapter 2, WebScraping.py #3

Open nverwer opened 10 years ago

nverwer commented 10 years ago

The structure of the HTML on gold.org has changed. This illustrates the danger of webpage screping, but it also breaks the example given in WebScraping.py. In order to make things difficult, there are no 'id' attributes on the HTML elements with the prices now.

The result is an error: IndexError: list index out of range.

Changing the line where price is determined to:

price= scraping.findAll("dd",attrs={"class":"value"})[0].text

seems to work.

It might be useful to add that the output file is buffered, so it will take some time before something appears in it.

AngelAlvarado commented 9 years ago

Hi nverwer,

Since there are a lot of classes 'value', How did you know that using that code would get the right price? because the [0]?

I found this solution:

    scraping = BeautifulSoup(page)
    assets = scraping.find_all("div", "asset-inner", limit=1)
    ask_asset = BeautifulSoup(str(assets[0]))
    price_value = ask_asset.find_all("dd", "value")[0].get_text()
    return price_value

But, of course yours is more accurate.

nverwer commented 9 years ago

Hi AngelAlvarado,

It is indeed because of the [0]. This is certain to break again in the future, but since there are no 'id'-attributes on the webpage any more (not when I looked at it anyway), it was the best solution I could come up with. I think your solution is also good, but of course web-scraping is a dangerous way to get information. The Bad Data Handbook (published by O'Reilly) has an interesting chapter on this.

AngelAlvarado commented 9 years ago

Gotcha,

Thanks for letting me know.

Definitely a dangerous way. After finishing this book, I'll take a look to the Bad Data Handbook. Thanks for the recommendation.