opsdisk / yagooglesearch

Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.
BSD 3-Clause "New" or "Revised" License
241 stars 42 forks source link

Always no results on next page #11

Closed NicoloCipiti closed 2 years ago

NicoloCipiti commented 2 years ago

First of all, thanks for the great tool, anyway it seems it fails to perform the search for any 2nd page: "No valid search results found on this page. Moving on..."

opsdisk commented 2 years ago

Hi @Hoculus - thanks for submitting an issue.

NicoloCipiti commented 2 years ago

query = "inurl:xxx/xxxxx" client = yagooglesearch.SearchClient( query, lang="it", verbosity=4, tbs="li:1", proxy=PROXY, verify_ssl=False, num=10, max_search_result_urls_to_return=100, minimum_delay_between_paged_results_in_seconds=3, yagooglesearch_manages_http_429s=False, # Add to manage HTTP 429s. )

I've tried a bunch of settings anyway but same issue output:

No valid search results found on this page. Moving on...

and stops

opsdisk commented 2 years ago

Thanks @Hoculus

FYI, any URLs with google.com in it are thrown out, so in case your query has it, that would be why.

It's possible there isn't another page. Currently, the logic to determine that hasn't been implemented (the HTML class names are randomized, so it's hard to pick out with the BeautifulSoup HTML parser) so it has to browse to a page and determine there are no results.

When you run it, there should be a Requesting URL: INFO line near the beginning of the output (see screenshot below).

For example...within an ipython shell...

import yagooglesearch

query = "site:github.com"

client = yagooglesearch.SearchClient(
    query,
    lang="it",
    verbosity=4,
    tbs="li:1",
    verify_ssl=False,
    num=10,
    max_search_result_urls_to_return=100,
    minimum_delay_between_paged_results_in_seconds=3,
    yagooglesearch_manages_http_429s=False,
)
client.assign_random_user_agent()

urls = client.search()

len(urls)

image

If you browse to that link in a browser and scroll to the bottom, does it show a "2" or "Avanti" signifying another page of search results?

image

NicoloCipiti commented 2 years ago

Look this for example:

2022-02-11 23:16:03,202 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2022-02-11 23:16:05,295 [MainThread  ] [WARNING] Looks like your IP address is sourcing from a European Union location...your search results may vary, but I'll try and work around this by updating the cookie.
2022-02-11 23:16:05,295 [MainThread  ] [INFO] Updating cookie to: {'CONSENT': 'YES+shp.gws-20211108-0-RC1.fr+F+027'}
2022-02-11 23:16:05,305 [MainThread  ] [INFO] Stats: start=0, num=10, total_valid_links_found=0 / max_search_result_urls_to_return=1000
2022-02-11 23:16:05,306 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=site%3Atwitter.com&btnG=Google+Search&tbs=0&safe=off&cr=&filter=0
2022-02-11 23:16:06,217 [MainThread  ] [INFO] Found unique URL #1: https://twitter.com/teachthought
2022-02-11 23:16:06,217 [MainThread  ] [INFO] Found unique URL #2: https://twitter.com/communia_eu
2022-02-11 23:16:06,217 [MainThread  ] [INFO] Found unique URL #3: https://twitter.com/greenpeace_pl
2022-02-11 23:16:06,217 [MainThread  ] [INFO] Found unique URL #4: https://twitter.com/faojobs
2022-02-11 23:16:06,225 [MainThread  ] [INFO] Found unique URL #5: https://twitter.com/businessbecause
2022-02-11 23:16:06,225 [MainThread  ] [INFO] Found unique URL #6: https://twitter.com/osticket
2022-02-11 23:16:06,226 [MainThread  ] [INFO] Found unique URL #7: https://twitter.com/themarkfdn
2022-02-11 23:16:06,226 [MainThread  ] [INFO] Found unique URL #8: https://twitter.com/yoasobi_staff
2022-02-11 23:16:06,226 [MainThread  ] [INFO] Found unique URL #9: https://twitter.com/israelacademy
2022-02-11 23:16:06,227 [MainThread  ] [INFO] Found unique URL #10: https://twitter.com/unsgsa
2022-02-11 23:16:06,228 [MainThread  ] [INFO] Sleeping 2 seconds until retrieving the next page of results...
2022-02-11 23:16:08,248 [MainThread  ] [INFO] Stats: start=10, num=10, total_valid_links_found=10 / max_search_result_urls_to_return=1000
2022-02-11 23:16:08,248 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=site%3Atwitter.com&start=10&tbs=0&safe=off&cr=&filter=0
2022-02-11 23:16:09,092 [MainThread  ] [INFO] No valid search results found on this page.  Moving on...

This is the output for the second example you provide in the README.

I actually use start=x as a workaround for this, so this issue doesn't really matter to me.. just curious why is not working just in my case.

opsdisk commented 2 years ago

Yeah, that is odd. There was an issue (https://github.com/opsdisk/yagooglesearch/issues/5) a few months back about Google not liking traffic sourcing from EU countries, which yagooglesearch detects ("Looks like your IP address is sourcing from a European Union location"). Would be curious if you got the same results sourcing your IP from a non-EU country?

One thing that comes to mind is that this value (https://github.com/opsdisk/yagooglesearch/blob/master/yagooglesearch/__init__.py#L361) is no longer "valid" for paged Google searches. I grabbed it from a browser a few months back. You'll notice there's a date and country code ("fr") that worked at the time. You could try swapping that with what is found in your browser (using your browser's DevTools) when you browse to google.com and have to "Accept" their search terms.

Anecdotally, I've seen my own search results vary across different searches and I don't know why. I chalk it up to Google not being as consistent as I'd hope.

One last thing, you mind letting me know what version you're running? It's the __version__ string found in __init__.py

https://github.com/opsdisk/yagooglesearch/blob/master/yagooglesearch/__init__.py#L15

NicoloCipiti commented 2 years ago

I'll try asap using non EU proxy.. meanwhile here's the cookie payload when the cookie banner is in Italian

consentCookiePayload='YES+shp.gws-20220209-0-RC2.it+FX+020'

changing language in the banner leads to a different payload with the country code in the string itself so maybe the solution could be swapping the language always to the known one.

Here's the version I'm using right now:

__version__ = "1.6.0"

opsdisk commented 2 years ago

Good to see that the version is the most recent.

If you're comfortable doing it, you could modify the cookie here and see if that helps. If you're using a virtual environment, it will be in a path similar to .venv/lib/python3.7/site-packages/yagooglesearch-1.6.0-py3.7.egg/yagooglesearch/__init__.py

I'm doubtful, but If that does the trick, I'll look at modifying it on the fly.

opsdisk commented 2 years ago

Hi @Hoculus - any updates with this?

NicoloCipiti commented 2 years ago

Hi @opsdisk, it actually solved the problem! thanks.

truskk commented 6 months ago

how exactly did you fix the issue? Ive tried changing the consent cookie country code and it still doesn't work

opsdisk commented 6 months ago

Hey @truskk - you mind opening a new issue with some more details of what you're trying to do and the results you're seeing?