skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
805 stars 59 forks source link

Example for Srcaping A Website with Login #133

Closed masiqbal closed 3 years ago

masiqbal commented 3 years ago

describe what you want to achieve

Hi,

I have read the existing documentation, but I did not find an example if we want to retrieve data from a website that requires login first and requires following several links to retrieve the data we really want.

For example, for a scenario like this: 1- Open the login form page 2- Enter the username and password 3- Submit the form 4- Open the redirected page after successful login 5- Open the link provided 6- Retrieve data

I really appreciate it if anyone can help set an example to understand better how skrape{it} works for those scenarios. Thank you.

christian-draeger commented 3 years ago

hey @masiqbal :) there is a proposal / experimental feature that describes how authentication could work as part of the fetcher here: https://github.com/skrapeit/skrape.it/blob/master/integrationtests/src/test/kotlin/ExperimentalDslTest.kt#L131

How its working internally: the idea is to implement an Authentication interface that can than (as seen in the example test) be passed as authentication parameter. the authorization header value that has been calculated by correasponding Authentication implementations will then be added as Authorization-request-header by the fetchers, e.g.:

conclusion

✅ if you need basic auth you are already good to go 💡 Oauth2 or other authentication flows are currently not supported, BUT they "just" needs to be implemented either as feature to be part of the skrape{it} library or by the user 🔁 workaround: not as smooth as it should be but you could also do the authentication flow "by hand". for oauth2 secured domains afaik this would meen to call the auth endpoint with basic auth (user: clientID, pw: clientSecret) to receive the access_token. the access_token could then be added as Bearer to the authetication header.

masiqbal commented 3 years ago

Hi @christian-draeger Thank you for your response. I really appreciate it. But I meant authentication by posting credentials via HTML login form, not basic auth like the example. Besides, I still don't understand how to run the next requests after a successful login, for example, clicking the link while still holding the required session/cookie.

christian-draeger commented 3 years ago

@masiqbal skrape{it} is "just" a http client and html parser, thereby your described scenario is possible but you have to think of it as there would be no UI and reverse engineer the website you want to scrape a little.

step 1 - login: the probably easiest way would be to login once via browser and and have a look at the cookies. do a request with the exact same cookies via skrape{it}s HttpFetcher or BrowserFetcher or at least with the login relevant cookies. it really depends on the login mechanism of the page though and can not be answered uniformly.

step 2 - follow a link: since neither a http client nor a html parser is able to do clicks you would need to extract the link you are intrested in and call it again. here is a basic example: lets assume we have the following websites markup.

<!DOCTYPE html>
<html lang="en">
    <body>
        <a href="http://some.url">first link</a>
        <a href="http://some-other.url">second link</a>
        <a href="/relative-link">relative link</a>
    </body>
</html>

we could do following:

// extract the link we are inetrested in:
val interestingLink = skrape(HttpFetcher) {
    request {
        url = "http://my-fancy.page"
        cookies = mapOf(
            "xxx" to "yyy",
            "foo" to "bar"
        )
    }
    extract {
        htmlDocument {
           // try to find a good selector or just have a look for links that match a text, e.g.:
            a {
                findAll { first { it.ownText == "second link" } }.attribute("href")
            }
        }
    }
}
// call the extracted link
skrape(BrowserFetcher) {
    request {
        url = interestingLink
        cookies = mapOf(
            "xxx" to "yyy",
            "foo" to "bar"
        )
    }
    expect {
        htmlDocument {
            // do your stuff...
        }
    }
}
masiqbal commented 3 years ago

@christian-draeger Thank you very much for the explanation.

Since my goal is to retrieve data automatically every certain time, it would be a hassle to log in and retrieve cookies manually.

Is there a way to request by posting form values? For example with an HTML form like this:

`

`

Then, is there a way to get cookies from the result of the previous request?

christian-draeger commented 3 years ago

sending POST request with a body is currently not supported. i will have a look, how it could be integrated and think about possible solutions.

until then you could use any other http client to receive the data like cookies etc. and add them to a skrape{it} request do all the request stuff with another http-client and just parse the response body of your target website via skrape{it}. for instance using kohttp it would look this:

val response: Response = httpPost {
    host = "postman-echo.com"
    path = "/post"

    // form body implicitly sets Content-type header to application/x-www-form-urlencoded - you may need to add the header yourself if using another http client
    body {
        form {
            "login" to "user"
            "email" to "john.doe@gmail.com"
        }
    }
}
// to something with the response (usually deserialize json response body with jackson or kotlinx.serialization)
response.body...

or if we assume an Oauth2 flow and have insights like client_id and client_secret of the identity provider you would probably do something like (i am again using kohttp in the example):

@JsonIgnoreProperties(ignoreUnknown = true)
data class IdpResponse(
    val token_type: String,
    val access_token: String,
)

val (token_type, access_token) = httpPost {
    url("https://login.microsoftonline.com/xxxxxxxxx/oauth2/token") // i testet it with an service that uses azure active directory as an identy provider
    body {
        form {
            "grant_type" to "client_credentials"
            "client_id" to "xxxxxx"
            "client_secret" to "xxxxxx"
        }
    }
}.toType<IdpResponse>() ?: throw Exception("could not deserialize...")

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy-oauth2-secured.page"
        headers = mapOf(
            "Authorization" to "$token_type $access_token"
        )
    }
    extract {
        htmlDocument {
            // ....
        }
    }
}
christian-draeger commented 3 years ago

i hope i could clarify as much as possible. i will close this for now. please let me know if you have further questions