skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
805 stars 59 forks source link

[QUESTION] Scrape elements from text\templates block #181

Closed QAutomatron closed 2 years ago

QAutomatron commented 2 years ago

Hey. Want to scrape some WordPress-based sites with dynamic content using BrowserFetcher. Elements that I want are located in <script type="text/template"> block, but the fetcher is not rendered them correctly and mark as CDATA Is there a way to render them? Or maybe I missed something in configuration.

Example

<ul class="products products-container skeleton-loading list pcols-lg-3 pcols-md-3 pcols-xs-2 pcols-ls-2 pwidth-lg-3 pwidth-md-3 pwidth-xs-2 pwidth-ls-1"> 
    <script type="text/template">
//<![CDATA[
"\t\t\n<li class=\"product-col product type-product post-19102 status-publish first instock product_cat-green-stuff-world product_cat-terrain product_tag-green-stuff-world has-post-thumbnail taxable shipping-taxable purchasable product-type-simple\">\n<div class=\"product-inner\">\n\t\n\t<div class=\"product-image\">\n\n\t\t...
//]]>
               </script> 
   </ul> 
christian-draeger commented 2 years ago

Hey, Is the html example taken from an actual browser or is this what you get after rendering with BrowserFetcher?

If this is what the site returns it seams to be correct behavior of the parser to do nothing with the CDATA block because all characters enclosed in a CDATA block are by definition interpreted as characters, not markup or entity references. Thereby every character is taken literally and will not be rendered.

If you want to parse it anyway this should be possible by selecting the script and extract its inner Text (which basically means get the CDATA block as String). Then remove the CDATA sequences that are surrounding the markup inside the CDATA block string and render this string separately using htmlDocument(yourCdataBlockValue)

Please let me know if it works otherwise I will check tomorrow or at latest on Monday when being back 2 keyboard

PS: since I am sending this message from my phone and not sitting in front of a PC it's hard to provide ad-hoc code samples right now

QAutomatron commented 2 years ago

Thanks for such quick response. Provided example is what BrowserFetcher returns.

In real browser it looks like this:

Screenshot 2022-02-12 at 01 32 07

And in a page source it looks like this (or if i use HttpFetcher):

                                <div class="archive-products">
                                    <ul class="products products-container skeleton-loading list pcols-lg-3 pcols-md-3 pcols-xs-2 pcols-ls-2 pwidth-lg-3 pwidth-md-3 pwidth-xs-2 pwidth-ls-1">
                                        <script type="text/template">
                                        "\t\t\n<li class=\"product-col product type-product post-19102 status-publish first instock product_cat-green-stuff-world product_cat-terrain product_tag-green-stuff-world has-post-thumbnail taxable shipping-taxable purchasable product-type-simple\">\n ...
</ul></div>

But as a workaround i will try to get CDATA block and parse it.

christian-draeger commented 2 years ago

FYI: i created an upstream issue/question at html-unit project which skrape{it} is using internally to render pages. i breaked it down to the reason we are getting this CDATA blocks

QAutomatron commented 2 years ago

Thanks.

Also tried workaround you suggest. I can easily get content of <script type="text/template"> and use it as htmlDocument(don't even need to render anything), but then i need to get rid of useless data like \t\t\n or escaping quotes \" but it's still better than nothing.

christian-draeger commented 2 years ago

Yeah, I recognized this as well. the text inside the script block is not valid html and includes line indentations (tabs \t and new lines \n) . Who knows what WordPress is doing there 😄

But glad to hear the work-around was possible