ranahaani / GNews

A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.
https://pypi.org/project/gnews/
MIT License
745 stars 107 forks source link

URL no longer working due to cookie consent page #62

Open murnanedaniel opened 1 year ago

murnanedaniel commented 1 year ago

Running

google_news = GNews(period = "1d")
results = google_news.get_news("russia")

gives results such as

[{'title': "Russia Is Gaining Influence in Africa, at West's Expense - Foreign Policy",
  'description': "Russia Is Gaining Influence in Africa, at West's Expense  Foreign Policy",
  'published date': 'Sat, 18 Mar 2023 07:00:00 GMT',
  'url': 'https://consent.google.com/m?continue=https://news.google.com/rss/articles/CBMiYmh0dHBzOi8vZm9yZWlnbnBvbGljeS5jb20vMjAyMy8wMy8xOC9ydXNzaWFuLW1lcmNlbmFyaWVzLWFyZS1wdXNoaW5nLWZyYW5jZS1vdXQtb2YtY2VudHJhbC1hZnJpY2Ev0gEA?oc%3D5&gl=DK&m=0&pc=n&cm=2&hl=en-US&src=1',
  'publisher': {'href': 'https://foreignpolicy.com/',
   'title': 'Foreign Policy'}},
...

Where the URL now directs to a cookie consent screen:

image

Is there a way to consent to the cookies somehow?

themetalleg commented 1 year ago

problem described here already: #53 but I cant get it to work.

ranahaani commented 1 year ago

orig_url = requests.get(get_news()['url']).url

can you try that

izdrail commented 1 year ago

I've done something like this if it helps anyoane . I've found the answer on stack overflow :


# Ref: https://stackoverflow.com/a/59023463/

_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_PREFIX_WITH_CONSENT = "https://consent.google.com/m?continue=https://news.google.com/rss/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX_WITH_CONSENT)}(?P<encoded_url>[^?]+)")
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')

@functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]  # type: ignore
    encoded_text += "==="  # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)
    print (match)

    primary_url = match.groupdict()["primary_url"]  # type: ignore
    primary_url = primary_url.decode()
    return primary_url

def decode_google_news_url(url: str) -> str:
    return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url