Panther is way too long (30 sec+?)

Gobmichet commented 5 years ago

Hi, I'm developing a WebScraper and chose symfony/panther because the target website is using full JS (Goutte can't handle it). The problem is that just a simple request on "wikipedia.org" takes at least 25sec+ ! So, if i add a little treatment, it easily exceeds the 30sec limit and throw an exception.

I also have this error message each time even when the request succeeds and i have a correct scraping result... PantherBug I tried to manually change the port in the "ChromeManager.php"

private function getDefaultOptions(): array
    {
        return [
            'scheme' => 'http',
            'host' => '127.0.0.1',
            'port' => 57850, // default is 9515
            'path' => '/status',
        ];
    }

but no matter the port, i always get this issue and i do think that it's the reason why it takes me 30 sec for a little treatment (just request wikipedia and screenshot takes 26 sec+ ! )

Please tell me that it's not normal to take such time just for a request and a little treatment ? Any idea on how to get things better ?

Thanks in advance !

Gobmichet commented 5 years ago

Any help please ?

stof commented 5 years ago

Well, first question, why are you doing some web scraping with WebDriver in the context of a web request ? That's kind of insane from a performance point of view. Web requests are limited in time, and you plan to spend that time starting the webdriver server, starting chrome, loading another website and making a bunch of WebDriver calls to perform actions. That should clearly be handled by some kind of background worker instead.

But for your case, the error says that the server explicitly rejected the connection on port 57850, which is the port where the chrome driver should be listening. So either it did not start properly (but the part of the logs you provided in the screenshot does not allow to answer that question) or something on your server forbids the connection (some antivirus software maybe ?).

On a side note, copy-pasting the logs as text is generally much more readable than taking a screenshot of them.

Gobmichet commented 5 years ago

Thanks for your answer ;)

Indeed i'm new to webScraping and found some tuto on the web using Symfony/Goutte (does not support JS) and symfony/Panther (does support JS) What's more i have to handle

connexion (login/pass to access next page)
make a research with keywords on 2nd page
access 3rd page, scrap the results of the research
and for the different results, allow to access to a "detail" page (4th one)

the whole thing manipulated via a web interface, somebody using it and clicking through the different steps. So having a background service doing the job won't change anything since the person will have to wait for the response anyway...

Therefore, i thought it was a good idea to use it... I started with Goutte that took 3sec instead of 16+ for panther just scraping a wikipedia research...

for the stackTrace i only have that, i'm accessing the app via my browser and a @route (symfony 4 project). The Controller then launches the different scraping methods according to the user clicks.

Here is a "sample" code, a test i did on wikipedia.org & example.com ::


  /**
    *   Just a simple & basic example of web Scraping
    *       using Symfony/Panther ::
    */
    public function doSimpleScrapUsingPanther()
    {
        $client =\Symfony\Component\Panther\Client::createChromeClient();
        $crawler = $client->request('GET', 'http://example.com');
        $fullPageHtml = $crawler->html();
        $pageH1 = $crawler->filter('h1')->text();
        echo '> h1 :: '.$pageH1."\n";
        $firstParag = $crawler->filter('p')->first()->text();
        echo '> 1st parag :: '.$firstParag."\n";
        echo "<br />";
        // More complex operations (with panther specific ones) ::
        $crawler2 = $client->request('GET', 'http://www.wikipedia.org/');
        $form = $crawler2->filter('#search-form')->form(['search' => 'web scraping']);
        $crawler2 = $client->submit($form);
        // specific to panther ::
        $client->takeScreenshot('pantherScreen.png');
        $client-> waitFor('.firstHeading');
        echo "> Wikipedia :: ".$crawler2->filter('.mw-parser-output p')->first()->text();
        echo "<br />";
    }

for info, the bad perfs really bother me, and i was watching for an alternative... i was looking for CasperJS for example but it merely does the same thing... create a browser and act like a human in order to scrap and pass forms or links clicks...

would you have some recomandations having in mind that the project is a symfony 4 one ?

And for panther, any idea on what i am doing "wrong" with the linked code ? You agree i should be able to reduce the treatment time with panther ?

PS :: should i start the WebDriver manually from the console cmd for example ? or maybe directly into the code before doing anything via Panther ?

stof commented 5 years ago

the whole thing manipulated via a web interface, somebody using it and clicking through the different steps. So having a background service doing the job won't change anything since the person will have to wait for the response anyway...

but you will be able to send them a response with an in progress page, and get the scrapping result asynchronously. This way, you won't get the 30s limit.

but anyway, anything relying on starting a browser and loading pages in it will of course be much slower than only doing a single HTTP request with Goutte (which will not need to start anything, will never download the CSS, JS and images and will never execute JS).

Doing live web scraping is simply not what WebDriver is meant to do (live web scraping by itself is not meant to be done anyway from a performance point of view).

Gobmichet commented 5 years ago

oh yeah i do agree with you ! and i started with Goutte! but the website i'm scraping is fully JS and therefore, I can't scrap anything on it with Goutte :(

So indeed, i don't know any other alternatives than panther or casperJS. Do you ?

but you will be able to send them a response with an in progress page, and get the scrapping result asynchronously. This way, you won't get the 30s limit.

i think i don't understand, when doing the scrap with Panther, i'll show a loader to the user in my own interface so he does understand that the treatment is in progress...

but Panther itself will still have this limit of 30s for doing the requested scrap.

So my question was, is there a mean to reduce this time that panther takes to scrap? 'cause yeah, for the very same treatment, Goutte takes 3 seconds where Panther takes 15+...

if you have any alternative to panther for scraping a webSite in full JS, i'm interested ! ;)

and finally, you didn't answer my question about webdriver and the stackTrace, was it stupid ? :p you told me :

Web requests are limited in time, and you plan to spend that time starting the webdriver server, starting chrome, loading another website and making a bunch of WebDriver calls to perform actions.

Would several seconds possibly be taken by the webdriver service launch ? Could i therefore gain some time launching it in advance by any command/code ?

PS: i'm really newbie in scrapping, i didn't even know the word two weeks ago, indeed i was asked to scrap LinkedIn profiles to tell you the truth, since they make their API not free anymore from today! I do know it is legal 'cause a judge even told them not to put "anti-scrap" measures so...

Any help/advice will be appreciated ^^

symfony / panther

Panther is way too long (30 sec+?) #183