[QUESTION] Can I access the contents of HTML elements without it being trimmed?

hey, i think i would probably do something like this:

@Language("HTML")
val someMarkupWithLinks: String = """
            <div class="foo">
                <ul>
                    <li>
                        <span>abc</span>
                        <a href="http://www.valid.com">a valid link</a>
                    </li>
                    <li>
                        <span>def</span>
                        <a href="http://www . invalid . com">an invalid link</a>
                    </li>
                    <li>
                        <span>ghi</span>
                        <a href="http://valid.com/rocks">another valid link</a>
                    </li>
                    <li>
                        <span>jkl</span>
                        <a href="http://www.invalid.com/ whitespaced/path">another invalid link</a>
                    </li>
                </ul>
            </div>
"""

fun main() {
    // get the href attributes of all a-tags in the document
    val hrefValues = htmlDocument(someMarkupWithLinks) {
        a {
            findAll {
                eachHref
            }
        }
    }

    // filter for the ones that contain a whitespace
    val hrefsWithWhiteSpace = hrefValues.filter { it.contains(" ") }

    println(hrefsWithWhiteSpace) // will print following list --> '[http://www . invalid . com, http://www.invalid.com/ whitespaced/path]'
}

if you need the hrefs together with its corresponding link text you can instead of using eachHref just call eachLink which will return a Map<String, String> with the link text as key and the href value as its value. here a one-liner using eachLink:

htmlDocument(someMarkupWithLinks) { a { findAll { eachLink } } }.filter { it.value.contains(" ") }.also { println(it) }
// will print --> '{an invalid link=http://www . invalid . com, another invalid link=http://www.invalid.com/ whitespaced/path}'

If you would want to know if the links text contains whitespaces you can just filter on the eachLink keys instead of the values

Hope this helps. just let me know if you have more questions :) i'm using version 1.0.0-alpha8 in the example

skrapeit / skrape.it

[QUESTION] Can I access the contents of HTML elements without it being trimmed? #128