Closed timfennis closed 3 years ago
hey, i think i would probably do something like this:
@Language("HTML")
val someMarkupWithLinks: String = """
<div class="foo">
<ul>
<li>
<span>abc</span>
<a href="http://www.valid.com">a valid link</a>
</li>
<li>
<span>def</span>
<a href="http://www . invalid . com">an invalid link</a>
</li>
<li>
<span>ghi</span>
<a href="http://valid.com/rocks">another valid link</a>
</li>
<li>
<span>jkl</span>
<a href="http://www.invalid.com/ whitespaced/path">another invalid link</a>
</li>
</ul>
</div>
"""
fun main() {
// get the href attributes of all a-tags in the document
val hrefValues = htmlDocument(someMarkupWithLinks) {
a {
findAll {
eachHref
}
}
}
// filter for the ones that contain a whitespace
val hrefsWithWhiteSpace = hrefValues.filter { it.contains(" ") }
println(hrefsWithWhiteSpace) // will print following list --> '[http://www . invalid . com, http://www.invalid.com/ whitespaced/path]'
}
if you need the hrefs together with its corresponding link text you can instead of using eachHref
just call eachLink
which will return a Map<String, String> with the link text as key and the href value as its value.
here a one-liner using eachLink
:
htmlDocument(someMarkupWithLinks) { a { findAll { eachLink } } }.filter { it.value.contains(" ") }.also { println(it) }
// will print --> '{an invalid link=http://www . invalid . com, another invalid link=http://www.invalid.com/ whitespaced/path}'
If you would want to know if the links text contains whitespaces you can just filter on the eachLink keys instead of the values
Hope this helps. just let me know if you have more questions :) i'm using version 1.0.0-alpha8
in the example
describe what you want to archive
I want to use this library to parse HTML documents and find links that (incorrectly) contain spaces inside links. For example
But it appears that the content is always trimmed. Is there any way I can access the contents of these links without trim being invoked?
Code Sample