skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
805 stars 59 forks source link

[QUESTION] Find Elements using Regex #135

Closed NusretOzates closed 3 years ago

NusretOzates commented 3 years ago

describe what you want to archive Hello! I would like to find elements using regex like in this beautiful soup example :)

Code Sample self.get_element(r'\s*id=\"([a-z0-9A-Z\"\'_\-\s]*footer[a-z0-9A-Z\"\'_\-\s]*)\"', decoded_object)

christian-draeger commented 3 years ago

it's currently not support but should be possible to add as feature. i will have a look how it could be integrated in the DSL to be user friendly. if you have any suggestions or wishes we can discuss them if you want :)

christian-draeger commented 3 years ago

After thinking of it I feel like I decided to decline the feature request because:

aListOfDocElement.filter { it.toCssSeletor.matches("some regex".toRegex() } 

What I can imagine is to provide a helper function that is doing the filtering

EDIT: to catch multiple elements at the same time or by partial class, id, ... attribute values this can be useful. i have added it to the DSL.

christian-draeger commented 3 years ago

example:

<body>
    i'm the body
    <header>
        <h1>i'm the headline</h1>
        <nav>
            <ol class='ordered-navigation'>
                <li>1st nav item</li>
                <li>2nd nav item</li>
                <li>3rd nav item</li>
                <li>last nav item</li>
            </ol>
            <ul class='unordered-navigation'>
                <li>1st nav item</li>
                <li>2nd nav item</li>
                <li>3rd nav item</li>
                <li>last nav item</li>
            </ul>
        </nav>
    </header>
</body>

assuming aValidDocument will invoke the given example html snippet

@Test
fun `can pick element by css selector matching regex`() {
    val someRegex = "^(ol|ul).*navigation$".toRegex()

    aValidDocument {
        findBySelectorMatching(someRegex) {
            expectThat(map { it.toCssSelector }).containsExactly(
                "html > body > header > nav > ol.ordered-navigation",
                "html > body > header > nav > ul.unordered-navigation"
            )
        }
    }
}

@Test
fun `can pick element by css selector matching regex DSL invoke`() {
    val someRegex = "^(ol|ul).*navigation$".toRegex()

    aValidDocument {
        someRegex {
            expectThat(map { it.toCssSelector }).containsExactly(
                "html > body > header > nav > ol.ordered-navigation",
                "html > body > header > nav > ul.unordered-navigation"
            )
        }
    }
}
NusretOzates commented 3 years ago

The filter idea looks great actually! Thanks a lot for that commit too! So for CSS selectors, I can just use it in the test examples and for other attributes (like id), I can use filters very nice! Like this:

 extract {
            htmlDocument{

             html {
                 findAll {
                     filter {
                         it.attribute("id").matches("([a-z0-9A-Z\"\'_\-\s]*footer[a-z0-9A-Z\"\'_\-\s]*)\"".toRegex())
                     }
                 }
             }

I have one more request too but I am not sure if I should open a new issue for it. Can you add an example of how to import the library when using the SNAPSHOT version of the library

christian-draeger commented 3 years ago

The snapshot release to jitpack seems to be broken currently. I will have a look within the next days. For all features discussed here or mentioned in the readme version 1.0.0 of artifact skrapeit has been published :)