symfony / panther

A browser testing and web crawling library for PHP and Symfony
MIT License
2.92k stars 216 forks source link

keep the chromeClient alive ? #184

Open Gobmichet opened 5 years ago

Gobmichet commented 5 years ago

Hi, I'm actually developping a webScrapper using symfony/Panther that will be piloted by someone through a web interface. Therefore and since i'll have to do different scraping methods that will be fired on the user's clicks and that the scrapped site needs an authentication, i have to "keep the $client alive" . illustrating code ::

    $client =\Symfony\Component\Panther\Client::createChromeClient();
        $crawler = $client->request('GET', 'http://www.wikipedia.org/');
        $form = $crawler->filter('#search-form')->form(['search' => 'web scraping']);
        $crawler = $client->submit($form);
        $client->takeScreenshot('pantherScreen.png');
        echo "> Wikipedia :: ".$crawler2->filter('.mw-parser-output p')->first()->text();

taking previous code as an example, id' like to separate the different actions like make a research and take a screenshot into two different methods that'd be triggered on user's click.

Assuming that on the scrapped site there is also a connexion form to pass before doing anything, i obviously have to "keep" my $client object "alive" in order to avoid to reconnect for each scrap method launch.

Developing a symfony4 project, is there a (good?) way to achieve this ? Just using Symfony?

Thanks in advance ;)

Note: I tried to keep the $client in $session but even if the object is there (got the Object structure when i "dump" it via the symfony debug bar), something is missing 'cause the rest of the code fails on an error i can't even catch...

navitronic commented 5 years ago

I'm pretty sure that what you're trying to achieve is outside of the scope of this library, as is.

What you'd ideally want is an instance of chromedriver process constantly running and have panther connect to that, rather than firing up its own instance of the process.

You can learn more about how panther initiates the chromedriver process in the start method of this class: https://github.com/symfony/panther/blob/master/src/ProcessManager/ChromeManager.php#L43-L63

Your first step toward achieving what you want, would be to create a new class that implements the BrowserManagerInterface and then figure out the necessary steps to get that into your project.

Gobmichet commented 5 years ago

Thanks for the advice, i'll have a look.

What about a new process/thread always running and listening for requests ? i thought about the project during my hollidays and thought it would be a good idea to create a 2nd thread with all the scraping logic in order to keep the client alive ("infinite" listening while until kill message) and would only respond to the symfony app when asked ?

Gobmichet commented 5 years ago

another thing, what about the "detach" option i've seen in several stack overflow posts ? I don't know how it works but according to what i've understood, it could solve my problem!?

Any info about this "detach" thing ?

Gobmichet commented 5 years ago

Also got on this :: https://github.com/symfony/panther/blob/master/CHANGELOG.md

Add a PHPUnit extension to keep alive the webserver and the client between tests

Allow keeping the webserver and client active even after test teardown

but here too, i don't know how to use it since i'm not doing any tests... I'm not extending pantherTestCase... but if it's possible for tests, it should be available to scrap right ?

nevertheless, i can't find illustrating code for those features :(

Gobmichet commented 5 years ago

@dunglas any advice please ?