opsdisk / yagooglesearch

Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.
BSD 3-Clause "New" or "Revised" License
241 stars 42 forks source link

Search always return an empty result list #43

Open rombru opened 3 months ago

rombru commented 3 months ago

Hello, I'm using the version 1.10.0 of the package (Python version 3.12), on Windows, from Belgium. Each time I'm calling the search() function, it returns an empty result list. When I try in my browser, it's working well and it does return some results. And when I try with the package https://github.com/MarioVilas/googlesearch, it's working too.

I managed to reproduce the issue by opening the link in a private window and noticed that it was because the content of the page is : image image

I found my problem similar to the issue #5 , but not exactly the same. I guess this has something to do with cookies but don't really know how to solve it. I tried with multiple configuration of the SearchClient but it's always the same problem.

Here are the logs.txt

Do you have an idea ?

opsdisk commented 3 months ago

Hi @rombru - apologies it took a few days to answer back. Can you provide me the entire command and switches you used?

rombru commented 3 months ago

Here is the code:

import yagooglesearch

client = yagooglesearch.SearchClient(
    "Paris",
    tld="com",
    lang_html_ui="fr",
    lang_result="lang_fr",
    tbs="li:1",
    max_search_result_urls_to_return=20,
    http_429_cool_off_time_in_minutes=45,
    http_429_cool_off_factor=1.5,
    verbosity=5,
    verbose_output=True,
)
client.assign_random_user_agent()

results = []

for result in client.search():
    print(result)

And here are the logs:

2024-05-10 15:13:24,125 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2024-05-10 15:13:24,885 [MainThread  ] [DEBUG]     status_code: 200
2024-05-10 15:13:24,885 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4'}
2024-05-10 15:13:24,885 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[<Cookie SOCS=CAAaBgiAx_WxBg for .google.com/>, <Cookie AEC=AQTF6Hx69P_kYG3EtIvcfGwbu_B-BX2NuoUD64fZgXUxLQmc99S60GpfTw for .google.com/>, <Cookie __Secure-ENID=19.SE=lY6fEcOnWjImUW4gHGpjFStmEmqTMePJ1iKBNVDNHgWYxXhgKbsAHfYv5no0t2F09H3rVAwBLp6dMbnXnEnLf5wj1oTxwrVRCPFfepWLhxAVEATkWO5q1x14qQULH8a1HndOsGPGfDIhWymH_kBJfZdsEWKHZa_hxTSmlVtzGqN7Gg73afgOD3ogSw for .google.com/>]>
2024-05-10 15:13:24,887 [MainThread  ] [DEBUG]     proxy: 
2024-05-10 15:13:24,887 [MainThread  ] [DEBUG]     verify_ssl: True
2024-05-10 15:13:24,888 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=20
2024-05-10 15:13:24,888 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=fr&lr=lang_fr&q=Paris&num=100&btnG=Google+Search&tbs=li:1&safe=off&cr=&filter=0
2024-05-10 15:13:25,828 [MainThread  ] [DEBUG]     status_code: 200
2024-05-10 15:13:25,828 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4'}
2024-05-10 15:13:25,828 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[]>
2024-05-10 15:13:25,828 [MainThread  ] [DEBUG]     proxy: 
2024-05-10 15:13:25,828 [MainThread  ] [DEBUG]     verify_ssl: True
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=fr&continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-none
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=fr&continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-none
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=fr&continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-none
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=fr&continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-none
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/technologies/cookies?hl=fr&utm_source=ucb
2024-05-10 15:13:25,835 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/technologies/cookies?hl=fr&utm_source=ucb
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gl=NL&hl=fr&cm=2&pc=srp&uxe=none&src=1
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Dfr%26lr%3Dlang_fr%26q%3DParis%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3Dli:1%26safe%3Doff%26cr%3D%26filter%3D0&gl=NL&hl=fr&cm=2&pc=srp&uxe=none&src=1
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/privacy?hl=fr&utm_source=ucb
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/privacy?hl=fr&utm_source=ucb
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/terms?hl=fr&utm_source=ucb
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/terms?hl=fr&utm_source=ucb
2024-05-10 15:13:25,836 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2024-05-10 15:13:25,836 [MainThread  ] [INFO] No valid search results found on this page.  Moving on...
opsdisk commented 3 months ago

Thanks for that.

  1. I'm getting results with the pastables you provided from a US IP, but just to check, can you try again with these?
import yagooglesearch

query = "Paris"

client = yagooglesearch.SearchClient(
    query,
    tbs="li:1",
    max_search_result_urls_to_return=20,
    http_429_cool_off_time_in_minutes=45,
    http_429_cool_off_factor=1.5,
    verbosity=5,
    verbose_output=True,
)
client.assign_random_user_agent()

urls = client.search()

len(urls)

for url in urls:
    print(url)
  1. Your source IP is in a European country. There's been some issues with this in the past so https://github.com/opsdisk/yagooglesearch/blob/master/src/yagooglesearch/__init__.py#L374 was added. Are you able to source the search from a different IP (through a VPS, SSH tunnel, VPN, etc.)?

  2. If you're familiar with inspecting network traffic in browser dev tools (https://developer.chrome.com/docs/devtools/network), you can inspect the cookie value looking for GOOGLE_ABUSE_EXEMPTION, copy/paste that string, and pass it to google_exemption when instantiating yagooglesearch.SearchClient

  3. Looks like googlesearch is accessing the local cookie jar (https://github.com/MarioVilas/googlesearch/blob/master/googlesearch/__init__.py#L89) and would possibly use the cookies you got from the screenshot you provided when you accepted the terms. If your comfortable in Python, you could try adding code to support that in yagooglesearch. yagooglesearch uses Python requests though so it'd take some research https://requests.readthedocs.io/en/latest/user/quickstart/#cookies

rombru commented 3 months ago

Thanks, I already tried several options:

  1. It doesn't work, and give me the same empty result list.
  2. Tried with my US VPN, and it does work
  3. Haven't been able to find that cookie, since it's not a captcha page I'm not sure if Google set that kind of cookie. From what I saw, I only have a "SOCS" cookie.

Will probably tried to look a bit more into it when I'll have some time

opsdisk commented 3 months ago

I haven't run into a "SOCS" cookie yet. Would love to see a screenshot or pastable of what's in it. I wish the library wasn't geolocation dependent in order to work correctly, and I hate saying "Just your US VPN", but that may be the easiest solution for you right now.