Option to split the process in get results and parsing

thebennos commented 7 years ago

Currently it is one process (Get data and parse it. )

Google can change the html every time and the complete process fails or outputs wrong results. an option to split the process in two parts would be nice, like this:

Get Data ->output it as json - so I can cache or save it in a database, S3 Storage
Load json -> parse it

two independent process. If google does changes in html. No problem, we have time to adjust the parsing and can parse it later.

gsouf commented 7 years ago

Hi @thebennos

If I understood correctly your question what you want to do is to extract the html (the dom) and to to be able to parse it latter. Right?

If this is your question then it's already possible and it's actually very simple!

Html is not parsed until you call getNaturalResults() and instead of parsing the result you can extract the dom and the url to use them latter. See:

    $googleUrl = new GoogleUrl();
    $googleUrl->setSearchTerm('simpsons');

    $response = $googleClient->query($googleUrl);

    // now instead of parsing result we will get the data from the response

    $html = $response->getDom(); 
    // $html is a DOMDocument instance (see php documentation for further details)

    $url = $response->getUrl();
    // $url is an url object, you can transtype it to string

Now you can store this url and this html at the place of your convenience and latter you can parse it again:

    $url = ....; // url stored previously
    $html = ....; // html stored previously

    $serp = new GoogleSerp($html, $url);

    $serp->getNaturalResults();

Does that answer your question?

thebennos commented 7 years ago

oh, I did not realized the getDOM function yet.

"If I understood correctly your question what you want to do is to extract the html (the dom) and to to be able to parse it latter. Right?" Yes.

Thats cool, so I can integrate RabbitMQ as message transport system and split it in different worker jobs. Great, thx.

gsouf commented 7 years ago

I'm closing the issue because all looks good now.

serp-spider / search-engine-google

Option to split the process in get results and parsing #60