Closed QAutomatron closed 2 years ago
Hey, Is the html example taken from an actual browser or is this what you get after rendering with BrowserFetcher?
If this is what the site returns it seams to be correct behavior of the parser to do nothing with the CDATA block because all characters enclosed in a CDATA block are by definition interpreted as characters, not markup or entity references. Thereby every character is taken literally and will not be rendered.
If you want to parse it anyway this should be possible by selecting the script and extract its inner Text (which basically means get the CDATA block as String). Then remove the CDATA sequences that are surrounding the markup inside the CDATA block string and render this string separately using htmlDocument(yourCdataBlockValue)
Please let me know if it works otherwise I will check tomorrow or at latest on Monday when being back 2 keyboard
PS: since I am sending this message from my phone and not sitting in front of a PC it's hard to provide ad-hoc code samples right now
Thanks for such quick response. Provided example is what BrowserFetcher
returns.
In real browser it looks like this:
And in a page source it looks like this (or if i use HttpFetcher
):
<div class="archive-products">
<ul class="products products-container skeleton-loading list pcols-lg-3 pcols-md-3 pcols-xs-2 pcols-ls-2 pwidth-lg-3 pwidth-md-3 pwidth-xs-2 pwidth-ls-1">
<script type="text/template">
"\t\t\n<li class=\"product-col product type-product post-19102 status-publish first instock product_cat-green-stuff-world product_cat-terrain product_tag-green-stuff-world has-post-thumbnail taxable shipping-taxable purchasable product-type-simple\">\n ...
</ul></div>
But as a workaround i will try to get CDATA block and parse it.
FYI: i created an upstream issue/question at html-unit project which skrape{it} is using internally to render pages. i breaked it down to the reason we are getting this CDATA blocks
Thanks.
Also tried workaround you suggest. I can easily get content of <script type="text/template">
and use it as htmlDocument
(don't even need to render anything), but then i need to get rid of useless data like \t\t\n
or escaping quotes \"
but it's still better than nothing.
Yeah, I recognized this as well. the text inside the script block is not valid html and includes line indentations (tabs \t and new lines \n) . Who knows what WordPress is doing there 😄
But glad to hear the work-around was possible
Hey. Want to scrape some WordPress-based sites with dynamic content using
BrowserFetcher
. Elements that I want are located in<script type="text/template">
block, but the fetcher is not rendered them correctly and mark asCDATA
Is there a way to render them? Or maybe I missed something in configuration.Example