ranahaani / GNews

A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.
https://pypi.org/project/gnews/
MIT License
707 stars 107 forks source link

Cant get full text of articles for last few days #101

Open vincenzon opened 1 month ago

vincenzon commented 1 month ago

I have an automated process that searches for and downloads articles every few hours. As of July 19th 2024 it stopped getting the article text. I traced some examples and it looks like here:

https://github.com/ranahaani/GNews/blob/a322163a40a0db2294b68ab50b1a6243fb69d2d4/gnews/utils/utils.py#L25C15-L25C62

The google news url is supposed to be dereferenced to the original source url, but that is not happening. If I manually decode the google url to the original source url, things work as expected.

I'm unsure if there was a change on the Google side, or on my side that broke this. For now I am inserting a base64 decode of the Google link into my processing pipeline. If there is a cleaner or more permanent fix I'd like to hear it.

T3z3nis commented 1 month ago

I have the same problem, how do you decode the link?

vincenzon commented 1 month ago

I found this, I think through Stack Overflow.

import base64
from urllib.parse import urlparse

def decode_google_news_url(source_url):
    url = urlparse(source_url)
    path = url.path.split('/')
    if (
        url.hostname == "news.google.com" and
        len(path) > 1 and
        path[len(path) - 2] == "articles"
    ):
        base64_str = path[len(path) - 1]
        decoded_bytes = base64.urlsafe_b64decode(base64_str + '==')
        decoded_str = decoded_bytes.decode('latin1')

        prefix = bytes([0x08, 0x13, 0x22]).decode('latin1')
        if decoded_str.startswith(prefix):
            decoded_str = decoded_str[len(prefix):]

        suffix = bytes([0xd2, 0x01, 0x00]).decode('latin1')
        if decoded_str.endswith(suffix):
            decoded_str = decoded_str[:-len(suffix)]

        bytes_array = bytearray(decoded_str, 'latin1')
        length = bytes_array[0]
        if length >= 0x80:
            decoded_str = decoded_str[2:length+1]
        else:
            decoded_str = decoded_str[1:length+1]

        return decoded_str
    else:
        return source_url

url = decode_google_news_url(n['url'])
caiolivf commented 1 month ago

Same problem here! The article.title output is "Google News".

from gnews import GNews
google_news = GNews()
json_resp = google_news.get_news('Pakistan')
article = google_news.get_full_article(json_resp[0]['url'])  # newspaper3k instance, you can access newspaper3k all attributes in article
article.title

# Google News
Isaaq-Khader commented 1 month ago

I also have the same problem with the articles. I tried running the base64 decoder mentioned here, but it gives me what looks like a random series of characters. I'm curious if Google changed the article link length and this is affecting any decoding but either way, this seems to be some sort of decoding issue that defaults to no output and the title of "Google News".

Example:

    source_url = 'https://news.google.com/rss/articles/CBMiWkFVX3lxTE80Y0I5WjZtTlBBcTJYM2hVTkN1R0oxd0JLQk9tUHFlV3pKRVFsZzk2RnRETnd5RmJuOVdQTlM5VG1tYlQyMmNvenpRN0FNcndZdm4xdnJ3Qk90UdIBX0FVX3lxTE05ekJYblBXZkJGQ0gwRGgwaXcyeDJkMERSRWxvTkpfN09YNDdiX295N3g2UlVBbFhUUkoxVzVKeU12Nk1yQ1o5UThMTnJhOTZ0T1FWV19ta1p6SnBZajlr?oc=5&hl=en-US&gl=US&ceid=US:en'
    print(decode_google_news_url(source_url))

Output: AU_yqLO4cB9Z6mNPAq2X3hUNCuGJ1wBKBOmPqeWzJEQlg96FtDNwyFbn9WPNS9TmmbT22cozzQ7AMrwYvn1vrwBOtQ

When it should link to: https://www.bbc.com/news/articles/ce58p0048r0o

phytal commented 1 month ago

Having same encoding issue :(

ranahaani commented 1 month ago

@vincenzon Thanks for this, can you please create a PR for this patch

sif-gondy commented 1 month ago

Fix from @vincenzon initially worked for couple of days and now getting the same output as @Isaaq-Khader for links.

bckenstler commented 1 month ago

Same!

sif-gondy commented 1 month ago

Possible workaround found here: https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=4500912

Source code: https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=5132769#gistcomment-5132769

xiyuanHou commented 1 month ago

you can try the decoder function from https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=5132769#gistcomment-5132769

it works for me

Isaaq-Khader commented 1 month ago

That worked for me! I have it in my code now and it allows me to fetch the articles as before. Hopefully, this is a nice, permanent fix. Thank you guys for sharing :)

TomoyaKuroda commented 1 month ago

Great! Can someone make pull request for this issue?

jun0-ds commented 1 month ago

I thought google blocks the base64 encoding, so used another way to solve.

I get the original url by using selenium current_url

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()), options=options
    )

    return driver

def url_changes(oldUrl):
    def predicate(driver):
        return driver.current_url != oldUrl

    return predicate

def original_url_selenium(url: str, driver) -> str:
    driver.get(url)

    _ = WebDriverWait(driver, 5).until(url_changes(url))

    return driver.current_url

driver = get_driver()
url = "https://news.google.com/rss/articles/CBMifEFVX3lxTE9OVFhuaUsxTTFkZ3J6RURXNHRfOVRfMUp4aUQyV19FbXFITTRhQlQyTG9yd2lwb2lTWUY5cGU3YnV2R0JfbEVUZGhRWDN3cVluYTR1eFNRNDhzeUVJUHJZc196Zkxmb0t5U05veURCaDNoc0dMYXlTcWRzWng?oc=5&hl=en-US&gl=US&ceid=US:en"

original_url = original_url_selenium(url, driver)
edmundman commented 6 days ago

Seems like this is happening again

sif-gondy commented 5 days ago

New solution available here for the decoding: I tested it and it seems to solve the issue.

neeley-pate commented 4 days ago

How do you resolve the 429 timeout issues when decoding the URLs?