ranahaani / GNews

A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.
https://pypi.org/project/gnews/
MIT License
745 stars 107 forks source link

Google News URL format update #39

Closed sif-gondy closed 2 years ago

sif-gondy commented 2 years ago

Hello,

Thanks for providing this piece of code.

I have recently come across weird behavior regarding the period parameter (e.g. 7d, you can get news from weeks prior). More importantly, the number of news output have dramatically reduced recently when combining countries and languages or even just providing a language and leaving the country parameter to None (for English)

Turning language parameter to any other language (e.g. French ['fr']) returns 0 articles systematically even for popular searches.

I suspect Google has changed/updated their url format and/or available countries/languages !

wpdevs commented 2 years ago

I actually just was wrestling with this this weekend regarding the when= parameter. Stuff from months ago when when = '3d'

In my browser doing the query on their site itself (in the u.s.) the when= worked fine.

sif-gondy commented 2 years ago

Yes, upon closer look it's definitely the time_query formatting with the period parameter

    def _ceid(self):
        time_query = ''
        if self._start_date or self._end_date:
            if inspect.stack()[2][3] != 'get_news':
                warnings.warn(message=("Only searches using the function get_news support date ranges. Review the "
                                       f"documentation for {inspect.stack()[2][3]} for a partial workaround. \nStart "
                                       "date and end date will be ignored"), category=UserWarning, stacklevel=4)
                if self._period:
                    time_query += 'when%3A'.format(self._period)   #<---- maybe the "when%3A" formatting with the period string
            if self._period:
                warnings.warn(message=f'\nPeriod ({self.period}) will be ignored in favour of the start and end dates',
                              category=UserWarning, stacklevel=4)
            if self.end_date is not None:
                time_query += '%20before%3A{}'.format(self.end_date) 
            if self.start_date is not None:
                time_query += '%20after%3A{}'.format(self.start_date)
        elif self._period:
            time_query += 'when%3A'.format(self._period)

        return time_query + '&ceid={}:{}&hl={}-{}&gl={}'.format(self._country,
                                                                self._language,
                                                                self._language,
                                                                self._country,
                                                                self._country)

It works perfectly if I provide language='fr' and country='FR' and omit a period. So you can disregard my comment about Google updating their available countries/languages.

wpdevs commented 2 years ago

Oh interesting -- so the time filters will work if language and country are specified?

sif-gondy commented 2 years ago

No.., pretty much every time a time_query is passed, being period or _end_date/startdate, the get_news method returns very few outputs (or even None for very popular keywords in 5y searches) and/or outside the desired time range. For example a simple request:

from gnews import GNews
from pprint import pprint

google_news = GNews(language='en', country="US", period="7d")

news = google_news.get_news('Pakistan')

pprint(news, indent=4)

Returns >

[   {   'description': "Babar Azam: 'Pakistan's lower order falling cheaply "
                       "was disappointing'  ESPNcricinfo",
        'published date': 'Thu, 28 Jul 2022 07:00:00 GMT',
        'publisher': {   'href': 'https://www.espncricinfo.com',
                         'title': 'ESPNcricinfo'},
        'title': "Babar Azam: 'Pakistan's lower order falling cheaply was "
                 "disappointing' - ESPNcricinfo",
        'url': 'https://www.espncricinfo.com/story/sl-vs-pak-2022-2nd-test-babar-azam-says-pakistans-lower-order-falling-cheaply-was-disappointing-1326527'}]

(Note the date is outside the desired range)

Not only en English-US but also for other languages as well. If you remove any time_query parameter you get the full output though so this bug has to be about a change in the url formatting regarding those:

time_query += '%20before%3A{}'.format(self.end_date) 
time_query += '%20after%3A{}'.format(self.start_date)
time_query += 'when%3A'.format(self._period)

Shouldn't this line have curly brackets somewhere for the .format()?:

time_query += 'when%3A'.format(self._period)
wpdevs commented 2 years ago

hopefully someone can chime in here to confirm @sif-gondy's suspicions.

ranahaani commented 2 years ago

This issue has been fixed, https://github.com/ranahaani/GNews/pull/41 Thanks @sif-gondy

speenoo commented 1 year ago

@ranahaani Hello, period still isnt working for me