skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
815 stars 61 forks source link

[QUESTION] Can I access the contents of HTML elements without it being trimmed? #128

Closed timfennis closed 3 years ago

timfennis commented 3 years ago

describe what you want to archive

I want to use this library to parse HTML documents and find links that (incorrectly) contain spaces inside links. For example

Please click <a href="www.example.com">link </a>to visit my website.

But it appears that the content is always trimmed. Is there any way I can access the contents of these links without trim being invoked?

Code Sample

val text: List<String> = ....
val links = text.flatMap {
    try {
        htmlDocument(it).findAll("a")
    } catch (e: ElementNotFoundException) {
        emptyList()
    }
}

val linksWithSpaces = links.filter {
    it.text != it.text.trim()
}
christian-draeger commented 3 years ago

hey, i think i would probably do something like this:

@Language("HTML")
val someMarkupWithLinks: String = """
            <div class="foo">
                <ul>
                    <li>
                        <span>abc</span>
                        <a href="http://www.valid.com">a valid link</a>
                    </li>
                    <li>
                        <span>def</span>
                        <a href="http://www . invalid . com">an invalid link</a>
                    </li>
                    <li>
                        <span>ghi</span>
                        <a href="http://valid.com/rocks">another valid link</a>
                    </li>
                    <li>
                        <span>jkl</span>
                        <a href="http://www.invalid.com/ whitespaced/path">another invalid link</a>
                    </li>
                </ul>
            </div>
"""

fun main() {
    // get the href attributes of all a-tags in the document
    val hrefValues = htmlDocument(someMarkupWithLinks) {
        a {
            findAll {
                eachHref
            }
        }
    }

    // filter for the ones that contain a whitespace
    val hrefsWithWhiteSpace = hrefValues.filter { it.contains(" ") }

    println(hrefsWithWhiteSpace) // will print following list --> '[http://www . invalid . com, http://www.invalid.com/ whitespaced/path]'
}

if you need the hrefs together with its corresponding link text you can instead of using eachHref just call eachLink which will return a Map<String, String> with the link text as key and the href value as its value. here a one-liner using eachLink:

htmlDocument(someMarkupWithLinks) { a { findAll { eachLink } } }.filter { it.value.contains(" ") }.also { println(it) }
// will print --> '{an invalid link=http://www . invalid . com, another invalid link=http://www.invalid.com/ whitespaced/path}'

If you would want to know if the links text contains whitespaces you can just filter on the eachLink keys instead of the values

Hope this helps. just let me know if you have more questions :) i'm using version 1.0.0-alpha8 in the example