opsdisk / yagooglesearch

Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.
BSD 3-Clause "New" or "Revised" License
249 stars 43 forks source link

Got no results #5

Closed LeMoussel closed 2 years ago

LeMoussel commented 2 years ago

I wrote such a simple function and ran it. It returned an empty array. This was my 1st time using the module so shouldn't be a rate-limiting thing. I also waited for a long time and retried, still no results.

    gg_query = "topic cluster"
    gg_search = yagooglesearch.SearchClient(
        gg_query,
        verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
    )
    gg_search.assign_random_user_agent()

    urls = gg_search.search()

Result:

2021-11-06 09:07:23,558 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2021-11-06 09:07:23,727 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-06 09:07:23,727 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=en&filter=0
2021-11-06 09:07:23,906 [MainThread  ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page.  That implies there won't be any search results on the next page either.  Moving on...

whereas with the original googlesearch library, I get a result with this code:

    from googlesearch import search
    for url in search(gg_query, stop=20):
        print(url)
opsdisk commented 2 years ago

Thanks for submitting an issue @LeMoussel

1) Do you mind providing the Operating System, Python version, yagooglesearch version, and how you installed yagooglesearch (pip vs git clone)?

2) Do you get any results through the Web GUI when you use the URL below?

https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=en&filter=0

3) Shot in the dark, but you could try changing the language (assuming you're IP is sourcing from France):

import yagooglesearch

gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
    gg_query,
    verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
gg_search.assign_random_user_agent()
gg_search.lang = "fr"
gg_search.update_urls()

urls = gg_search.search()

Once in a while, no results will show up the first time, but it will work the second time. Haven't quite nailed down that issue though or why it happens. I tried your search and didn't have any issues.

image

LeMoussel commented 2 years ago

Do you get any results through the Web GUI when you use the URL below? Yes. screenshot

assuming you're IP is sourcing from France Yes, but I want produce results in 'en' language. Rem: I tested your code with gg_search.lang = "fr". I don't get any result. image

LeMoussel commented 2 years ago

I may have found the reason. I get this HTML page : result_gg.txt As if the cookies were not/misplaced. I notice that the cookie handling in your get_page() code is different from that of MarioVilas/googlesearch/get_page()

opsdisk commented 2 years ago

Interesting. The python requests library manages the cookies/cookiejar. I tried it in Windows 10 with Python 3.9.6 and didn't have any issues

image

You could try

1) A fresh virtual environment and re-install

2) Printing out the cookie before and after the search. Should be none for the first one, and then populated for the next after calling .search()

import yagooglesearch

gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
    gg_query,
    verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
print(gg_search.cookies)

gg_search.assign_random_user_agent()

urls = gg_search.search()
print(gg_search.cookies)

Not sure what else to recommend at this point. We can keep the issue open to see if anyone else runs into it as well.

kusinhavre commented 2 years ago

I also used this library for the first time today, and I think I ran into the same issue. For the search I did it returns an empty list, although when I open the generated url for the query in an ordinary web browser window it shows me 3 results.

opsdisk commented 2 years ago

Appreciate the additional data point @kusinhavre To assist me, do you mind providing the Operating System, Python version, yagooglesearch version, and how you installed yagooglesearch (pip vs git clone)?

kusinhavre commented 2 years ago

Appreciate the additional data point @kusinhavre To assist me, do you mind providing the Operating System, Python version, yagooglesearch version, and how you installed yagooglesearch (pip vs git clone)?

Absolutely:

(venv) %pip show yagooglesearch
Name: yagooglesearch
Version: 1.2.0
Summary: A Python library for executing intelligent, realistic-looking, and tunable Google searches.
Home-page: https://github.com/opsdisk/yagooglesearch
Author: Brennon Thomas
Author-email: info@opsdisk.com
License: BSD 3-Clause "New" or "Revised" License
Location: /venv/lib/python3.9/site-packages
Requires: requests, beautifulsoup4, requests
Required-by: 
(venv)  % python --version
Python 3.9.4
>>> import platform
>>> platform.platform()
'macOS-10.16-x86_64-i386-64bit'
>>> platform.system()
'Darwin'
>>> platform.release()
'20.6.0'
>>> import os 
>>> os.name
'posix'

Regards /H

LeMoussel commented 2 years ago
    import yagooglesearch

    gg_query = "topic cluster"
    gg_search = yagooglesearch.SearchClient(
        gg_query,
        verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
    )
    yagooglesearch.ROOT_LOGGER.info(f'Cokkies: {gg_search.cookies}')

    gg_search.assign_random_user_agent()

    urls = gg_search.search()
    yagooglesearch.ROOT_LOGGER.info(f'len urls: {len(urls)}')
    yagooglesearch.ROOT_LOGGER.info(f'Cokkies: {gg_search.cookies}')

Got this:

2021-11-11 08:05:35,656 [MainThread  ] [INFO] Cokkies: None
2021-11-11 08:05:35,656 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2021-11-11 08:05:35,794 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-11 08:05:35,794 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-11 08:05:35,995 [MainThread  ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page.  That implies there won't be any search results on the next page either.  Moving on...
2021-11-11 08:05:35,995 [MainThread  ] [INFO] len urls: 0
2021-11-11 08:05:35,995 [MainThread  ] [INFO] Cokkies: <RequestsCookieJar[]>

As you say, it's none for the first one, and then populated for the next after calling .search() but the search returns an empty list.

In debug mode, init.py#L395, got this html result: yGG.html.txt which seems that the cookie is not well used.

opsdisk commented 2 years ago

@kusinhavre I'm stumped right now. I ran it on a Mac without any issues. Let me know if I overlooked something with the os and platform output.

macos_pagodo

opsdisk commented 2 years ago

@LeMoussel, looks like it may be a cookie thing. I'll dig into requests some more and see what I can find. The screenshot you provided should show a lot more info. Something like:

image

opsdisk commented 2 years ago

@LeMoussel / @kusinhavre mind setting verbosity=5, re-running, and pasting the output here?

LeMoussel commented 2 years ago

Log with verbosity=5

2021-11-12 09:47:00,633 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[]> => This is perhaps the problem with RequestsCookieJar which is empty

Rem: I don't know if this has any effect but @kusinhavre and me are using Python version 3.9

2021-11-12 09:46:50,576 [MainThread  ] [INFO] Cokkies: None
2021-11-12 09:46:50,577 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2021-11-12 09:46:50,781 [MainThread  ] [DEBUG]     status_code: 200
2021-11-12 09:46:50,782 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.134 Safari/534.16'}
2021-11-12 09:46:50,782 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[<Cookie CONSENT=PENDING+741 for .google.com/>]>
2021-11-12 09:46:50,783 [MainThread  ] [DEBUG]     proxy: 
2021-11-12 09:46:50,784 [MainThread  ] [DEBUG]     verify_ssl: True
2021-11-12 09:46:50,785 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-12 09:46:50,785 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-12 09:47:00,632 [MainThread  ] [DEBUG]     status_code: 200
2021-11-12 09:47:00,633 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.134 Safari/534.16'}
2021-11-12 09:47:00,633 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[]>
2021-11-12 09:47:00,634 [MainThread  ] [DEBUG]     proxy: 
2021-11-12 09:47:00,634 [MainThread  ] [DEBUG]     verify_ssl: True
2021-11-12 09:47:00,713 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 09:47:00,714 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 09:47:00,714 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,715 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 09:47:00,716 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 09:47:00,717 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,718 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=FR&hl=en&pc=srp&src=1
2021-11-12 09:47:00,718 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=FR&hl=en&pc=srp&src=1
2021-11-12 09:47:00,719 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,720 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 09:47:00,721 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 09:47:00,722 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,723 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 09:47:00,724 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 09:47:00,725 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,726 [MainThread  ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page.  That implies there won't be any search results on the next page either.  Moving on...
2021-11-12 09:47:00,726 [MainThread  ] [INFO] len urls: 0
2021-11-12 09:47:00,727 [MainThread  ] [INFO] Cokkies: <RequestsCookieJar[]>
kusinhavre commented 2 years ago

Log with verbosity=5:

import yagooglesearch
query = 'hello'
client = yagooglesearch.SearchClient(query, max_search_result_urls_to_return=10, http_429_cool_off_time_in_minutes=45, http_429_cool_off_factor=1.5, verbosity=5)
client.assign_random_user_agent()
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'
urls = client.search()
2021-11-12 23:09:47,572 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2021-11-12 23:09:47,739 [MainThread  ] [DEBUG]     status_code: 200
2021-11-12 23:09:47,739 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'}
2021-11-12 23:09:47,739 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[<Cookie CONSENT=PENDING+510 for .google.com/>]>
2021-11-12 23:09:47,739 [MainThread  ] [DEBUG]     proxy: 
2021-11-12 23:09:47,739 [MainThread  ] [DEBUG]     verify_ssl: True
2021-11-12 23:09:47,740 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=10
2021-11-12 23:09:47,740 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=hello&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-12 23:09:47,922 [MainThread  ] [DEBUG]     status_code: 200
2021-11-12 23:09:47,922 [MainThread  ] [DEBUG]     headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'}
2021-11-12 23:09:47,922 [MainThread  ] [DEBUG]     cookies: <RequestsCookieJar[]>
2021-11-12 23:09:47,922 [MainThread  ] [DEBUG]     proxy: 
2021-11-12 23:09:47,922 [MainThread  ] [DEBUG]     verify_ssl: True
2021-11-12 23:09:47,931 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 23:09:47,931 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 23:09:47,931 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,931 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=SE&hl=en&pc=srp&src=1
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=SE&hl=en&pc=srp&src=1
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread  ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread  ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page.  That implies there won't be any search results on the next page either.  Moving on...
urls
[]
opsdisk commented 2 years ago

You both have a CONSENT=PENDING... cookie when yagooglesearch first browses to google.com. I did find something in this project based off sourcing from an EU country (https://github.com/benbusby/whoogle-search/issues/243)

Assuming you are sourcing from France @LeMoussel, are you also sourcing from a EU country @kusinhavre ?

To test this, I spun up a VPS server in France and ran into the same issue:

image

Not sure what I can do within yagooglesearch besides warn the user to source from a different IP address. It may work through the GUI because you're logged into a Google account (and have accepted their cookie consent legalese). For now, you'll have to source from a non-EU IP address to get around this. Let me know your thoughts...

LeMoussel commented 2 years ago

Yes, I'm sourcing from France. There must be a solution to this, because with the googlesearch library I get the results and whoogle-search also seems to have solved the EU cookie consent management.

kusinhavre commented 2 years ago

I can confirm that I'm sourcing from an EU country. For the rest, I agree with @LeMoussel that the googlesearch library works fine for me as well, and that is what I'm using for the moment instead.

opsdisk commented 2 years ago

Thanks for that info @LeMoussel / @kusinhavre - agree there's a solution to this. Give me a week or two to dig into it some more.

opsdisk commented 2 years ago

@LeMoussel / @kusinhavre

Mind testing this out when you have some time?

Branch is issue-5-eu-countries-require-cookie-modification: https://github.com/opsdisk/yagooglesearch/tree/issue-5-eu-countries-require-cookie-modification

PR: https://github.com/opsdisk/yagooglesearch/pull/6/files

kusinhavre commented 2 years ago

Hi again and sorry for replying this late, but I have not had time to test this until today.

Anyway, I did a test today, and it seems to work as expected, at least in my view. I passed the same query to both googlesearch and yagooglesearch, and got the same result:

urls
['https://sverigesradio.se/artikel/modo-med-fullt-os-sakrade-seger', 'https://sverigesradio.se/vast']
list_urls
['https://sverigesradio.se/vast', 'https://sverigesradio.se/artikel/modo-med-fullt-os-sakrade-seger']

urls is the result from yagooglesearch and _listurls is from googlesearch. Just the order is different but that is not of any importance to me.

Excellent work, thank you so much for the effort!

@LeMoussel / @kusinhavre

Mind testing this out when you have some time?

Branch is issue-5-eu-countries-require-cookie-modification: https://github.com/opsdisk/yagooglesearch/tree/issue-5-eu-countries-require-cookie-modification

PR: https://github.com/opsdisk/yagooglesearch/pull/6/files

opsdisk commented 2 years ago

Thanks for testing it out @kusinhavre - I'll merge it into master soon.

LeMoussel commented 2 years ago

Hi. Sorry for replying this late, but I have not had time to test this until today. It's OK and it seems to work as expected. Good Job! Here log with verbosity=5: yagooglesearch.py.log

opsdisk commented 2 years ago

Merged https://github.com/opsdisk/yagooglesearch/pull/6 into master