serp-spider / search-engine-google

:spider: Google client for SERPS
https://serp-spider.github.io
Other
165 stars 61 forks source link

Google DOM Change? #131

Open OmarMonterrey opened 3 years ago

OmarMonterrey commented 3 years ago

URL: https://www.google.com/search?q=download+youtube+thumbnail Expected: Correct parsing What I'm getting: Unable to check javascript status. Google DOM has possibly changed and an update may be required. The HTML is OK and I have composer completly up to date; I'm attaching HTML screenshot and content image

invalid_dom.zip

alexgarciab commented 3 years ago

Happened to me as well. The SERPS implementation for Google is not able to parse HTML correctly. Please, fix it ASAP.

kucugum commented 3 years ago

Yes, it looks like Google DOM has changed.

Since the below function in the package looks for the "class", and it returns null, all the functions that use javascriptIsEvaluated() breaks. For example: getNaturalResults and getAdwordsResults

public function javascriptIsEvaluated()
{
    $body = $this->getXpath()->query('//body');

    if ($body->length != 1) {
        throw new Exception('No body found');
    }

    $body = $body->item(0);
    /** @var $body \DOMElement */
    $class = $body->getAttribute('class');

    if ($class=='hsrp') {
        return false;
    } elseif (strstr($class, 'srp')) {
        return true;
    } else {
        throw new InvalidDOMException('Unable to check javascript status.');
    }
}

Do you have a plan about solving this issue?

Thank you

OmarMonterrey commented 3 years ago

Yes, it looks like Google DOM has changed.

Since the below function in the package looks for the "class", and it returns null, all the functions that use javascriptIsEvaluated() breaks. For example: getNaturalResults and getAdwordsResults

public function javascriptIsEvaluated()
{
    $body = $this->getXpath()->query('//body');

    if ($body->length != 1) {
        throw new Exception('No body found');
    }

    $body = $body->item(0);
    /** @var $body \DOMElement */
    $class = $body->getAttribute('class');

    if ($class=='hsrp') {
        return false;
    } elseif (strstr($class, 'srp')) {
        return true;
    } else {
        throw new InvalidDOMException('Unable to check javascript status.');
    }
}

Do you have a plan about solving this issue?

Thank you

You were right, the issue were right there but the body tag has the proper attributes, since I'm only using "getNaturalResults", I implemented a little hack; $html = preg_replace('/^.*?(<body)/is','$1', $html); Basically I removed all before <body tag, that way the DOM is parsed as expected and the classes are checked, so it's working for me now.

kucugum commented 3 years ago

Thank you, it works as a temporary fix. I hope the package will get an update about this for a permanent fix.

alexgarciab commented 3 years ago

So I have talked with the developer of this library. He told me that he does not have the time to maintain the library, so there won't be any updates from now sadly. 🙃

pedropamn commented 3 years ago

So I have talked with the developer of this library. He told me that he does not have the time to maintain the library, so there won't be any updates from now sadly.

This explains a lot of pull request being "ignored"...

migliori commented 3 years ago

The DOM to get the number of results has changed too. I applied @OmarMonterrey 's hack:

// in vendor/serps/core/src/Core/Http/SearchEngineResponse.php
    public function getPageContent()
    {
        $this->pageContent = preg_replace('/^.*?(<body)/is','$1', $this->pageContent);
        return $this->pageContent;
    }

And changed this to get the number of results:

// in vendor/serps/search-engine-google/src/Page/GoogleSerp.php
    public function getNumberOfResults()
    {
        $item = $this->cssQuery('#result-stats');
        // ... etc
    }
LunarDevelopment commented 3 years ago

The DOM to get the number of results has changed too. I applied @OmarMonterrey 's hack:

// in vendor/serps/core/src/Core/Http/SearchEngineResponse.php
    public function getPageContent()
    {
        $this->pageContent = preg_replace('/^.*?(<body)/is','$1', $this->pageContent);
        return $this->pageContent;
    }

And changed this to get the number of results:

// in vendor/serps/search-engine-google/src/Page/GoogleSerp.php
    public function getNumberOfResults()
    {
        $item = $this->cssQuery('#result-stats');
        // ... etc
    }

I've been running the following for about a year now and it's kept this change at bay:


    /**
 // in vendor/serps/search-engine-google/src/Page/GoogleSerp.php
     * Get the total number of results available for the search terms
     * @return int the number of results
     * @throws InvalidDOMException
     */
    public function getNumberOfResults()
    {
        $item = $this->cssQuery('#resultStats');

        if ($item->length < 1) {

            $item = $this->cssQuery('#result-stats');

            if ($item->length < 1) {
                return null;
            }
        }