Closed LeMoussel closed 2 years ago
Thanks for submitting an issue @LeMoussel
1) Do you mind providing the Operating System, Python version, yagooglesearch
version, and how you installed yagooglesearch
(pip vs git clone)?
2) Do you get any results through the Web GUI when you use the URL below?
3) Shot in the dark, but you could try changing the language (assuming you're IP is sourcing from France):
import yagooglesearch
gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
gg_query,
verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
gg_search.assign_random_user_agent()
gg_search.lang = "fr"
gg_search.update_urls()
urls = gg_search.search()
Once in a while, no results will show up the first time, but it will work the second time. Haven't quite nailed down that issue though or why it happens. I tried your search and didn't have any issues.
Do you get any results through the Web GUI when you use the URL below? Yes.
assuming you're IP is sourcing from France
Yes, but I want produce results in 'en'
language.
Rem: I tested your code with gg_search.lang = "fr"
. I don't get any result.
I may have found the reason. I get this HTML page : result_gg.txt As if the cookies were not/misplaced. I notice that the cookie handling in your get_page() code is different from that of MarioVilas/googlesearch/get_page()
Interesting. The python requests
library manages the cookies/cookiejar. I tried it in Windows 10 with Python 3.9.6 and didn't have any issues
You could try
1) A fresh virtual environment and re-install
2) Printing out the cookie before and after the search. Should be none for the first one, and then populated for the next after calling .search()
import yagooglesearch
gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
gg_query,
verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
print(gg_search.cookies)
gg_search.assign_random_user_agent()
urls = gg_search.search()
print(gg_search.cookies)
Not sure what else to recommend at this point. We can keep the issue open to see if anyone else runs into it as well.
I also used this library for the first time today, and I think I ran into the same issue. For the search I did it returns an empty list, although when I open the generated url for the query in an ordinary web browser window it shows me 3 results.
Appreciate the additional data point @kusinhavre To assist me, do you mind providing the Operating System, Python version, yagooglesearch version, and how you installed yagooglesearch (pip vs git clone)?
Appreciate the additional data point @kusinhavre To assist me, do you mind providing the Operating System, Python version, yagooglesearch version, and how you installed yagooglesearch (pip vs git clone)?
Absolutely:
(venv) %pip show yagooglesearch
Name: yagooglesearch
Version: 1.2.0
Summary: A Python library for executing intelligent, realistic-looking, and tunable Google searches.
Home-page: https://github.com/opsdisk/yagooglesearch
Author: Brennon Thomas
Author-email: info@opsdisk.com
License: BSD 3-Clause "New" or "Revised" License
Location: /venv/lib/python3.9/site-packages
Requires: requests, beautifulsoup4, requests
Required-by:
(venv) % python --version
Python 3.9.4
>>> import platform
>>> platform.platform()
'macOS-10.16-x86_64-i386-64bit'
>>> platform.system()
'Darwin'
>>> platform.release()
'20.6.0'
>>> import os
>>> os.name
'posix'
Regards /H
import yagooglesearch
gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
gg_query,
verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
yagooglesearch.ROOT_LOGGER.info(f'Cokkies: {gg_search.cookies}')
gg_search.assign_random_user_agent()
urls = gg_search.search()
yagooglesearch.ROOT_LOGGER.info(f'len urls: {len(urls)}')
yagooglesearch.ROOT_LOGGER.info(f'Cokkies: {gg_search.cookies}')
Got this:
2021-11-11 08:05:35,656 [MainThread ] [INFO] Cokkies: None
2021-11-11 08:05:35,656 [MainThread ] [INFO] Requesting URL: https://www.google.com/
2021-11-11 08:05:35,794 [MainThread ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-11 08:05:35,794 [MainThread ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-11 08:05:35,995 [MainThread ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page. That implies there won't be any search results on the next page either. Moving on...
2021-11-11 08:05:35,995 [MainThread ] [INFO] len urls: 0
2021-11-11 08:05:35,995 [MainThread ] [INFO] Cokkies: <RequestsCookieJar[]>
As you say, it's none for the first one, and then populated for the next after calling .search()
but the search returns an empty list.
In debug mode, init.py#L395, got this html result: yGG.html.txt which seems that the cookie is not well used.
@kusinhavre I'm stumped right now. I ran it on a Mac without any issues. Let me know if I overlooked something with the os
and platform
output.
@LeMoussel, looks like it may be a cookie thing. I'll dig into requests some more and see what I can find. The screenshot you provided should show a lot more info. Something like:
@LeMoussel / @kusinhavre mind setting verbosity=5
, re-running, and pasting the output here?
Log with verbosity=5
2021-11-12 09:47:00,633 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[]>
=> This is perhaps the problem with RequestsCookieJar
which is empty
Rem: I don't know if this has any effect but @kusinhavre and me are using Python version 3.9
2021-11-12 09:46:50,576 [MainThread ] [INFO] Cokkies: None
2021-11-12 09:46:50,577 [MainThread ] [INFO] Requesting URL: https://www.google.com/
2021-11-12 09:46:50,781 [MainThread ] [DEBUG] status_code: 200
2021-11-12 09:46:50,782 [MainThread ] [DEBUG] headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.134 Safari/534.16'}
2021-11-12 09:46:50,782 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[<Cookie CONSENT=PENDING+741 for .google.com/>]>
2021-11-12 09:46:50,783 [MainThread ] [DEBUG] proxy:
2021-11-12 09:46:50,784 [MainThread ] [DEBUG] verify_ssl: True
2021-11-12 09:46:50,785 [MainThread ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-12 09:46:50,785 [MainThread ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-12 09:47:00,632 [MainThread ] [DEBUG] status_code: 200
2021-11-12 09:47:00,633 [MainThread ] [DEBUG] headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.134 Safari/534.16'}
2021-11-12 09:47:00,633 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[]>
2021-11-12 09:47:00,634 [MainThread ] [DEBUG] proxy:
2021-11-12 09:47:00,634 [MainThread ] [DEBUG] verify_ssl: True
2021-11-12 09:47:00,713 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 09:47:00,714 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 09:47:00,714 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,715 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 09:47:00,716 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 09:47:00,717 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,718 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=FR&hl=en&pc=srp&src=1
2021-11-12 09:47:00,718 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dtopic%2Bcluster%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=FR&hl=en&pc=srp&src=1
2021-11-12 09:47:00,719 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,720 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 09:47:00,721 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 09:47:00,722 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,723 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 09:47:00,724 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 09:47:00,725 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 09:47:00,726 [MainThread ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page. That implies there won't be any search results on the next page either. Moving on...
2021-11-12 09:47:00,726 [MainThread ] [INFO] len urls: 0
2021-11-12 09:47:00,727 [MainThread ] [INFO] Cokkies: <RequestsCookieJar[]>
Log with verbosity=5:
import yagooglesearch
query = 'hello'
client = yagooglesearch.SearchClient(query, max_search_result_urls_to_return=10, http_429_cool_off_time_in_minutes=45, http_429_cool_off_factor=1.5, verbosity=5)
client.assign_random_user_agent()
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'
urls = client.search()
2021-11-12 23:09:47,572 [MainThread ] [INFO] Requesting URL: https://www.google.com/
2021-11-12 23:09:47,739 [MainThread ] [DEBUG] status_code: 200
2021-11-12 23:09:47,739 [MainThread ] [DEBUG] headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'}
2021-11-12 23:09:47,739 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[<Cookie CONSENT=PENDING+510 for .google.com/>]>
2021-11-12 23:09:47,739 [MainThread ] [DEBUG] proxy:
2021-11-12 23:09:47,739 [MainThread ] [DEBUG] verify_ssl: True
2021-11-12 23:09:47,740 [MainThread ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=10
2021-11-12 23:09:47,740 [MainThread ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=hello&num=100&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2021-11-12 23:09:47,922 [MainThread ] [DEBUG] status_code: 200
2021-11-12 23:09:47,922 [MainThread ] [DEBUG] headers: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.669.0 Safari/534.20'}
2021-11-12 23:09:47,922 [MainThread ] [DEBUG] cookies: <RequestsCookieJar[]>
2021-11-12 23:09:47,922 [MainThread ] [DEBUG] proxy:
2021-11-12 23:09:47,922 [MainThread ] [DEBUG] verify_ssl: True
2021-11-12 23:09:47,931 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 23:09:47,931 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gae=cb-
2021-11-12 23:09:47,931 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,931 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/technologies/cookies?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=SE&hl=en&pc=srp&src=1
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://consent.google.com/dl?continue=https://www.google.com/search?hl%3Den%26q%3Dhello%26num%3D100%26btnG%3DGoogle%2BSearch%26tbs%3D0%26safe%3Doff%26cr%3D%26filter%3D0&gl=SE&hl=en&pc=srp&src=1
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/privacy?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] pre filter_search_result_urls() link: https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] Excluding URL because it contains "google": https://policies.google.com/terms?hl=en&utm_source=ucb
2021-11-12 23:09:47,932 [MainThread ] [DEBUG] post filter_search_result_urls() link: None
2021-11-12 23:09:47,932 [MainThread ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page. That implies there won't be any search results on the next page either. Moving on...
urls
[]
You both have a CONSENT=PENDING...
cookie when yagooglesearch
first browses to google.com. I did find something in this project based off sourcing from an EU country (https://github.com/benbusby/whoogle-search/issues/243)
Assuming you are sourcing from France @LeMoussel, are you also sourcing from a EU country @kusinhavre ?
To test this, I spun up a VPS server in France and ran into the same issue:
Not sure what I can do within yagooglesearch
besides warn the user to source from a different IP address. It may work through the GUI because you're logged into a Google account (and have accepted their cookie consent legalese). For now, you'll have to source from a non-EU IP address to get around this. Let me know your thoughts...
Yes, I'm sourcing from France. There must be a solution to this, because with the googlesearch library I get the results and whoogle-search also seems to have solved the EU cookie consent management.
I can confirm that I'm sourcing from an EU country. For the rest, I agree with @LeMoussel that the googlesearch library works fine for me as well, and that is what I'm using for the moment instead.
Thanks for that info @LeMoussel / @kusinhavre - agree there's a solution to this. Give me a week or two to dig into it some more.
@LeMoussel / @kusinhavre
Mind testing this out when you have some time?
Branch is issue-5-eu-countries-require-cookie-modification
: https://github.com/opsdisk/yagooglesearch/tree/issue-5-eu-countries-require-cookie-modification
Hi again and sorry for replying this late, but I have not had time to test this until today.
Anyway, I did a test today, and it seems to work as expected, at least in my view. I passed the same query to both googlesearch and yagooglesearch, and got the same result:
urls
['https://sverigesradio.se/artikel/modo-med-fullt-os-sakrade-seger', 'https://sverigesradio.se/vast']
list_urls
['https://sverigesradio.se/vast', 'https://sverigesradio.se/artikel/modo-med-fullt-os-sakrade-seger']
urls is the result from yagooglesearch and _listurls is from googlesearch. Just the order is different but that is not of any importance to me.
Excellent work, thank you so much for the effort!
@LeMoussel / @kusinhavre
Mind testing this out when you have some time?
Branch is
issue-5-eu-countries-require-cookie-modification
: https://github.com/opsdisk/yagooglesearch/tree/issue-5-eu-countries-require-cookie-modification
Thanks for testing it out @kusinhavre - I'll merge it into master soon.
Hi. Sorry for replying this late, but I have not had time to test this until today. It's OK and it seems to work as expected. Good Job! Here log with verbosity=5: yagooglesearch.py.log
Merged https://github.com/opsdisk/yagooglesearch/pull/6 into master
I wrote such a simple function and ran it. It returned an empty array. This was my 1st time using the module so shouldn't be a rate-limiting thing. I also waited for a long time and retried, still no results.
Result:
whereas with the original googlesearch library, I get a result with this code: