ranahaani / GNews

A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.
https://pypi.org/project/gnews/
MIT License
708 stars 106 forks source link

The number of crawled paper is very small #88

Open Ricksanchez000 opened 6 months ago

Ricksanchez000 commented 6 months ago

Hi plz some one help me with this:

I utilized GNews to crawl News from 2023.10.1 to 2024.3.10 about the "Red Sea Crisis", but only got about 80 papers. But when I search key word in Factiva for the same duration, it has results about 3000 articles. I am doing NLP analysis so the volume of articles is quite essential.

Is the number of articles being limited by GNews or it simply does not have that much articles on Google News?

MonikaBarget commented 5 months ago

We have the same issue. I also tried to scrape Google News with a different code before and got 100 results max. per query. It seems that we need pagination but I am not sure how to implement this here. One option would be to work with the start and end dates, going through really small windows of time to collect more results for consecutive days.

MonikaBarget commented 5 months ago

This is a related issue suggesting some workarounds: https://github.com/ranahaani/GNews/issues/31

omswack commented 2 weeks ago

I think the max I can get is 100 a day. Can anyone do better?

MonikaBarget commented 2 weeks ago

No, we tried crawling by the hour, but then we did not get any additional results. They seem to have the same time stamp for all posts published in one day, so you get the same 100 results wherever you start. In the end, we adjusted the research question a bit to work with 100 results per day but crawl through several months of data.

omswack commented 2 weeks ago

I think I managed to get a few more by iterating every hour but I think this is okay for now- thanks for the response. I think the error warning of 'must be 1 day apart else no results will return' should be altered however because you can get results by looking from hour to hour. However, iterating by hour doesn't work 100% from my experience.