serp-spider / search-engine-google

:spider: Google client for SERPS
https://serp-spider.github.io
Other
168 stars 61 forks source link

Dom change - possible fix suggestion & examples attached. #105

Closed LunarDevelopment closed 6 years ago

LunarDevelopment commented 6 years ago

Hey,

I'm getting a Classical Result DOM error every 10 to 30 requests.

I've attached :

BadDom.zip is a ZIP of 4 example DOM files Bridal Accessories Bakewell7560.html.zip is a ZIP of 1 rendered HTML file

Here's a script to parse any of the BadDom files:


require_once __DIR__ . '/vendor/autoload.php';

use Serps\SearchEngine\Google\GoogleUrl;
use Serps\SearchEngine\Google\Page\GoogleSerp;

$url = GoogleUrl::fromString('https://google.com/?q=your+keywords');

$fileContent = file_get_contents(__DIR__ .  '/BadDom/A4 Landscape Dorchester890.html');

// Create a serp with the file content and the url
$serp = new GoogleSerp($fileContent, $url);

// and analyse it
$naturalResults = $serp->getNaturalResults();

prinr_r($naturalResults ) ;

Here's an example of a classical result element from the rendered HTML page:

<div class="g"><!--m-->
    <div data-hveid="90" data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQFQhaKAAwAg">
        <div class="rc">
            <div class="r"><a href="http://www.bridesofbakewell.com/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=http://www.bridesofbakewell.com/&amp;ved=0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQFghbMAI"><h3
                    class="LC20lb">Brides of Bakewell: Home</h3><br>
                <div style="display:inline-block" class="TbwUpd"><cite class="iUh30">www.bridesofbakewell.com/</cite></div>
            </a><span><div class="action-menu ab_ctl"><a class="GHDvEf ab_button" href="#" id="am-b2" aria-label="Result options" aria-expanded="false" aria-haspopup="true" role="button"
                                                         jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe" data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQ7B0IXDAC"><span class="mn-dwn-arw"></span></a><div
                    class="action-menu-panel ab_dropdown" role="menu" tabindex="-1" jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue" data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQqR8IXTAC"><ol><li
                    class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl"
                                                                                href="http://webcache.googleusercontent.com/search?q=cache:NcvOB9MEhX0J:www.bridesofbakewell.com/+&amp;cd=3&amp;hl=en&amp;ct=clnk&amp;gl=jp"
                                                                                ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=http://webcache.googleusercontent.com/search%3Fq%3Dcache:NcvOB9MEhX0J:www.bridesofbakewell.com/%2B%26cd%3D3%26hl%3Den%26ct%3Dclnk%26gl%3Djp&amp;ved=0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQIAheMAI">Cached</a></li><li
                    class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl"
                                                                                href="/search?near=middlesbrough,UK&amp;pws=0&amp;q=related:www.bridesofbakewell.com/+Bridal+Accessories+Bakewell&amp;tbo=1&amp;sa=X&amp;ved=0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQHwhfMAI">Similar</a></li></ol></div></div></span>
            </div>
            <div class="s">
                <div><span
                        class="st">Welcome to <em>Brides</em> of <em>Bakewell</em> – a beautiful <em>wedding</em> boutique in the Derbyshire ... <em>accessories</em> when you also buy a <em>wedding</em> dress from <em>Brides</em> of <em>Bakewell</em>.</span>
                </div>
            </div>
            <div jsl="$t t--ddbPTeIsNI;$x 0;" class="r-iMXdFOFX_Ubc" data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQ2Z0BCGAwAg">
                <div class="AUiS2 iMXdFOFX_Ubc-7_jVsFT_9Io" id="eobm_2" data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQx40DCGEwAg">
                    <div id="eobd_2" class="iMXdFOFX_Ubc-uhagcrfPmuU" style="display:none">
                        <div data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQsKwBCGIoADAC">alexandra bridal</div>
                        <div data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQsKwBCGMoATAC">stella york 6245</div>
                        <div data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQsKwBCGQoAjAC">bridesmaid dresses</div>
                        <div data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQsKwBCGUoAzAC">wedding dresses</div>
                        <div data-ved="0ahUKEwiWt6KQgq7cAhWKXrwKHQjyCaUQsKwBCGYoBDAC">asos</div>
                    </div>
                    <span class="XCKyNd" id="eobs_2" aria-label="Dismiss suggested follow ups" role="button" tabindex="0" jsaction="r.pz0qjfJrMDo" data-rtid="iMXdFOFX_Ubc" jsl="$x 2;"></span>
                    <div>
                        <div class="d8lLoc iMXdFOFX_Ubc-eEjGhTK0s34" id="eobc_2"><h4 class="eJ7tvc iMXdFOFX_Ubc-ZgH0LU9o8RU" id="eobp_2">People also search for</h4>
                            <div class="hYkSRb iMXdFOFX_Ubc-ICxnu-SGsqE" id="eobr_2"></div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div><!--n--></div>

Seems like the error is that h3 class is not always r.

Your library is better than anything I've ever produced, but my observation would be that in the ClassicalResult->parseNode() maybe the old code of ->xpathQuery("descendant::h3[@class='r'][1]/a", $node) is too specific now that a class of LC20lb is sometimes returned on the h3 element - should it be replaced with a more generic: ->xpathQuery("descendant::h3", $node) ?

I've tested the following and it seems to work in ClassicalResult->parseNode() to return the title :


        // find the title/url
        /* @var $aTag \DOMElement */
        $aTag=$dom
//             ->xpathQuery("descendant::h3[@class='r'][1]/a", $node) // old
            ->xpathQuery("descendant::h3", $node) // new
            ->item(0);
        if (!$aTag) {
            throw new InvalidDOMException('Cannot parse a classical result.');
        }

If you're happy with it I'll do a pull request with the amendment ?

gsouf commented 6 years ago

Hi @LunarDevelopment thanks, I'll look at it asap

LunarDevelopment commented 6 years ago

I don't think ->xpathQuery("descendant::h3", $node) // new is a good fix in retrospect, but I'm not good with XPATH - I'll leave the syntax to you..

Happy to help though, love your library!

gsouf commented 6 years ago

Hi @LunarDevelopment

Can you confirm that this is merged in version 0.4.3?