Closed wiradikusuma closed 7 years ago
Unfortunately, I can't seem to produce a document that has an HTML element with ID #priceblock_ourprice
. On my computer, in the Amazon page downloaded by JsoupBrowser
, the price is in this element:
scala> println(doc >?> text("#color_name_1_price"))
Some($15.99)
This may be a difference in localization or user-agent. It may help to change the user agent used by JsoupBrowser
(by passing it in its constructor) to make it match your browser.
If you are sure that an HTML document has a valid element which scala-scraper can't find by its id, please provide a static HTML page with which I can reproduce the problem (Document#toHtml
can be used for that in scala-scraper).
I checked from both Chrome and Jsoup, both contain #priceblock_ourprice
. Please see attachment.
Could it be A/B testing from Amazon? (same URL, but we receive different content)
Oh, I see; it happens because that item does not ship to Portugal, and so I am not shown the bigger priceblock_ourprice
price label.
However, I'm able to successfully extract the price both with from_chrome.html
and with from_jsoup.html
:
scala> val browser = new JsoupBrowser
browser: net.ruippeixotog.scalascraper.browser.JsoupBrowser = net.ruippeixotog.scalascraper.browser.JsoupBrowser@44de88e4
scala> val doc = browser parseFile "from_chrome.html"
doc: browser.DocumentType = (...)
scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)
scala> val doc = browser parseFile "from_jsoup.html"
doc: browser.DocumentType = (...)
scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)
Does this also happen to you when you load the HTML files from disk?
I've found the culprit: encoding. I need to explicitly tell Jsoup to use UTF-8. This works:
val doc = browser.parseInputStream(new URL(url).openStream, "UTF-8")
The reason why reading from my HTML files work is because I saved them as UTF-8.
Thanks for taking the time to investigate my issue. If you know a better way, feel free to add, otherwise just close this. Thanks!
I'm glad that you found out the problem :) It's a strange issue nonetheless, since JsoupBrowser
explicitly requests UTF-8 in HTTP requests. This may be a problem with how jsoup handles content encodings or with some missing headers in the request or response; it's hard to tell.
I'll close this for now, but please update this if you find out anything else about it.
I'm trying to scrape some Amazon page, e.g.
https://www.amazon.com/dp/B0756FN69M
but I can't extract the price, even though it's in the source code:Here's the code:
Output:
The "title" is extracted just fine.