spekulatius / PHPScraper

A universal web-util for PHP.
https://phpscraper.de
GNU General Public License v3.0
533 stars 75 forks source link

Provide example with authentication #83

Open tacman opened 2 years ago

tacman commented 2 years ago

How can I scrape a website that requires authentication?

That is, I want to start with at https://jardinado.herokuapp.com/login, fill in my credentials, and THEN start scraping the site.

That is, I want the $goutteClient to execute something like this first, then scrape:

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Now that cookies are set, when I fetch a url that requires login I should get the page instead of the 302 (redirect to login).

I'm not sure how to implement this within the context of phpscraper. One idea would be to expose the goutte client.

spekulatius commented 2 years ago

Hmmm, while this should work it's quite a bit work to debug with the site being down (doesn't load for me). Can you bring it back up @tacman ?

tacman commented 2 years ago

Try now. It's a slow site, at least initially, because it's running on a free heroku dyno. It can take up to 30 seconds to "wake up" if it's been inactive for a while.

I set up a login for you -- spekulatius@jardinado.com, password: spekulatius

spekulatius commented 2 years ago

Hey @tacman

can you share some more code on how you add this to PHPScraper?

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Thanks :)

tacman commented 2 years ago

Well, that's kind of the point of this issue -- I don't know how to do that. I only see how to click links with phpScraper:

https://github.com/spekulatius/PHPScraper/blob/master/src/phpscraper.php#L918

I was hoping there was a way to submit a form, which would keep the cookies for that session. So instead of ->clickLink(), a method like ->submitForm(), when I could send in the credentials, and then load a page and follow links that require authentication.

spekulatius commented 2 years ago

Ah okay, now we are getting a bit closer. I've wondered how you did it. Did you get it working with Goutte only?

tacman commented 2 years ago

I have a Symfony bundle that crawls a website: https://github.com/survos/SurvosCrawlerBundle

The idea is that if it can create a set of links that are visible (based on different logins), those links can then be used in a simple PHPUnit test. It basically does what almost all testers do in the beginning -- log in, and click blindly on every link. It's amazing how often someone finds a broken page that way.

So I was trying to use PHPScrapper to do that. In the end, I couldn't, so I just used what other tools I had available:

    public function authenticateClient(?string $username = null, string $plainPassword=null): void
    {
        // might be worth checking out: https://github.com/liip/LiipTestFixturesBundle/pull/62#issuecomment-622191412
        static $clients = [];
        if (!array_key_exists($username, $clients)) {
            $gouteClient = new Client();
            $gouteClient
                ->setMaxRedirects(0);
            $this->username = $username;
            $baseUrl = $this->baseUrl;
            $clients[$username] = $gouteClient;
            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . trim($this->loginPath, '/'), [
                    'proxy' => '127.0.0.1:7080'
                ]);

//            dd($crawler, $url);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() === 200, "Invalid route: " . $url);
//            dd(substr($response->getContent(),0, 1024), $url, $baseUrl);

// select the form and fill in some values
//                $form = $crawler->filter('login_form')->form();
                try {
                    $form = $crawler->selectButton($this->submitButtonSelector)->form();
                } catch (\Exception $exception) {
                    throw new \Exception($this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                }
//                assert($form, $this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                    $form['_username'] = $username;
                $form['_password'] = $plainPassword;

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() == 200, substr($response->getContent(), 0, 512) . "\n\n" . $url);

https://github.com/survos/SurvosCrawlerBundle/blob/main/src/Services/CrawlerService.php#L108

I don't love the code, though it's functional. If I could drop it all and replace it with PHPScraper, I would. Of course, if there's anything of value you can grab from my bundle, please do so!