ruippeixotog / scala-scraper

A Scala library for scraping content from HTML pages
MIT License
717 stars 106 forks source link

Inconsistent behaviour of text("#xxx") #55

Closed wiradikusuma closed 7 years ago

wiradikusuma commented 7 years ago

I'm trying to scrape some Amazon page, e.g. https://www.amazon.com/dp/B0756FN69M but I can't extract the price, even though it's in the source code:

 <span id="priceblock_ourprice" class="a-size-medium a-color-price">$15.99</span>

Here's the code:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.text

val browser = new JsoupBrowser
val url = "https://www.amazon.com/dp/B0756FN69M"
val doc = browser get url

println(doc >?> text("#productTitle"))
println(doc >?> text("#priceblock_ourprice"))

Output:

Some(TaoTronics Bluetooth Receiver, Portable Wireless Car Aux Adapter 3.5mm Stereo Car Kits, Bluetooth 4.0 Hands-free Bluetooth Audio Adapter for Home /Car Stereo Music Streaming Sound System)  
None

The "title" is extracted just fine.

ruippeixotog commented 7 years ago

Unfortunately, I can't seem to produce a document that has an HTML element with ID #priceblock_ourprice. On my computer, in the Amazon page downloaded by JsoupBrowser, the price is in this element:

scala> println(doc >?> text("#color_name_1_price"))
Some($15.99)

This may be a difference in localization or user-agent. It may help to change the user agent used by JsoupBrowser (by passing it in its constructor) to make it match your browser.

If you are sure that an HTML document has a valid element which scala-scraper can't find by its id, please provide a static HTML page with which I can reproduce the problem (Document#toHtml can be used for that in scala-scraper).

wiradikusuma commented 7 years ago

I checked from both Chrome and Jsoup, both contain #priceblock_ourprice. Please see attachment.

attachment.zip

Could it be A/B testing from Amazon? (same URL, but we receive different content)

ruippeixotog commented 7 years ago

Oh, I see; it happens because that item does not ship to Portugal, and so I am not shown the bigger priceblock_ourprice price label.

However, I'm able to successfully extract the price both with from_chrome.html and with from_jsoup.html:

scala> val browser = new JsoupBrowser
browser: net.ruippeixotog.scalascraper.browser.JsoupBrowser = net.ruippeixotog.scalascraper.browser.JsoupBrowser@44de88e4

scala> val doc = browser parseFile "from_chrome.html"
doc: browser.DocumentType = (...)

scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)

scala> val doc = browser parseFile "from_jsoup.html"
doc: browser.DocumentType = (...)

scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)

Does this also happen to you when you load the HTML files from disk?

wiradikusuma commented 7 years ago

I've found the culprit: encoding. I need to explicitly tell Jsoup to use UTF-8. This works:

val doc = browser.parseInputStream(new URL(url).openStream, "UTF-8")

The reason why reading from my HTML files work is because I saved them as UTF-8.

Thanks for taking the time to investigate my issue. If you know a better way, feel free to add, otherwise just close this. Thanks!

ruippeixotog commented 7 years ago

I'm glad that you found out the problem :) It's a strange issue nonetheless, since JsoupBrowser explicitly requests UTF-8 in HTTP requests. This may be a problem with how jsoup handles content encodings or with some missing headers in the request or response; it's hard to tell.

I'll close this for now, but please update this if you find out anything else about it.