serp-spider / search-engine-google

:spider: Google client for SERPS
https://serp-spider.github.io
Other
165 stars 61 forks source link

Google DOM change #115

Open msiemens opened 5 years ago

msiemens commented 5 years ago

Seems like Google is doing some sort of A/B test with a new DOM, which serp-spider can't parse.

Error message:
Unable to check javascript status. Google DOM has possibly changed and an update may be required.
Date:
2018-11-19
URL:
https://www.google.de/search?q=trinkgeld+steuerfrei&hl=de&gl=de&sourceid=chrome&ie=UTF-8&num=100
User Agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36
HTML sample:
trinkgeld+steuerfrei.html.zip

From a quick look at this it seems like Google has HTML inside JS strings which seem to throw off the DomDocument parser:

/* ... */ a=_.Tb('<head><base href="'+_.xb(window.document.baseURI)+'"></head><body><iframe id="'+a+'" name="'+a+'"></iframe>',null)) /* ... */

Later in the file the actual body tag follows as usual:

<body class="srp tbo vasq" ...>

The DomDocument parser seems to somehow think that the JS string starts the actual body tag which of course doesn't have the expected class attribute.

dragonattack commented 5 years ago

is there any solution to this?