symfony / panther

A browser testing and web crawling library for PHP and Symfony
MIT License
2.94k stars 222 forks source link

Crawler html is empty when retrieved from a function or class method #418

Open monkeyArms opened 3 years ago

monkeyArms commented 3 years ago

I ran into what I consider a 'bizarre' issue the other day:

When a Symfony\Component\Panther\Client instance is used within the same calling function/method and the Symfony\Component\Panther\DomCrawler\Crawler instance is retrieved, everything works.

However, if the Crawler is retrieved from another function/method, Crawler::html() provides an empty string.

Example class:

<?php

namespace App\Test;

use Symfony\Component\Panther\Client;
use Symfony\Component\Panther\DomCrawler\Crawler;

class PantherTest
{
    /**
     * @var string
     */
    protected $url;

    public function __construct()
    {
        $this->url = 'https://example.com/';
    }

    /**
     * @return Crawler
     */
    protected function fetchUrlGetCrawler(): Crawler
    {
        $client = Client::createChromeClient();

        $client->request( 'GET', $this->url );

        return $client->getCrawler();
    }

    public function test1()
    {
        $client = Client::createChromeClient();

        $client->request( 'GET', $this->url );

        $crawler = $client->getCrawler();

        dump( $crawler->html() );
    }

    public function test2()
    {
        $crawler = $this->fetchUrlGetCrawler();

        dump( $crawler->html() );
    }
}

The PantherTest::test1 method works as expected:

$test = new PantherTest();
$test->test1();

but the PantherTest::test2 method does not, even though the exact same code is duplicated inside another method:

$test = new PantherTest();
$test->test2();

I've tried this on both my local dev server, and a remote debian/apache server with the same results.

dunglas commented 3 years ago

It's probably because in the second test, the destructor of Client is called. Maybe should we store a reference to the client in the crawler to prevent this bug.

monkeyArms commented 3 years ago

Ah...that makes complete sense.

My use case was a method that would accept a URL argument and could return a Crawler instance regardless of how Crawler was created (via BrowserKit, Panther, or populated via a Symfony\Component\HttpClient\HttpClient response.

I ultimately decided to discard the 3rd (HttpClient) option, and return a Symfony\Component\BrowserKit\AbstractBrowser instance instead from the method, as the Client is sometimes needed anyway.

I don't know if the Crawler should know about the Client or not - I just know I spent more time than I preferred to tracking down where things went wrong in this instance. Perhaps at least the Client destructor could populate a Crawler flag that would cause an Exception to be thrown if the Client was no longer available with an explanation?

Either way, thank you for your response and the great library.