Closed lodenrogue closed 3 years ago
Uh oh, it looks like I'm bumping up against the News API's rate limit. I'm not sure what I can do about this besides migrating to a different underlying news API. I'll take a look.
I tried looking for another news api. Seems like they are all on the pay per use model. I'm working on a news aggregator library. I have Hacker news done. Thinking about adding others.
I added:
ABC news extraction: https://github.com/lodenrogue/abc-news
CNN news extraction: https://github.com/lodenrogue/cnn-news
BBC news extraction: https://github.com/lodenrogue/bbc-news
NPR news extraction: https://github.com/lodenrogue/npr-news
This extraction happens on your machine so no API token or rate limits to worry about. Let me know if you are interested in this functionality. I can add more news sources and provide a single API for all news extraction so you don't have to call each library independently.
Update: I wrote a service that generates about 95% of the code for news aggregation so now it only takes about 15-20 minutes to complete aggregation for a news source.
Do you know what endpoints you're using from NewsAPI? I can try to recreate those in a single service for you.
Hmm this is really interesting, I'm using the NewsApi package on npm, which is just a thin wrapper around the news api's /v2/everything and /v2/top-headlines endpoints. Can you give me a quick tldr about your packages? Are you scraping directly from the site (hence no API token needed)?
Yes. I'm scraping directly with BeauifulSoup4 and Newspaper3k which does NPL to detect article text. That means no API token or rate limits.
My libraries return the top n requested articles per news source.
If you can tell me what the difference is between the /everything and /top-headlines I can reproduce that functionality for you in a single library.
Edit: Ok. I did some research. Seems the /everything is for search queries and /top-headlines just returns the breaking headlines. I can definitely replicate /top-headlines functionality. /everything can probably be done with keyword search but that's a little more advanced than what I have right now.
Question for you: What percentage of the requests you get are for /everything vs /top-headlines. If we can replace a big chunk of those calls it would reduce the probability of hitting your rate limit on /everything. At least until I can replicate the functionality for that endpoint.
Consolidated news extraction services into a single project and added documentation on how to get headlines.
Wait a minute, your project is in Python. What did you have in mind? I see several options here, but I would not want to do any cross language calls within this project.
My project is intended to be as low entry for the end user as possible, only requiring a curl request to an endpoint. Even without requiring an API token, there is also the chance in the future that my box hosting this project could be IP banned by the end services for sending thousands of requests to them.
I think it might actually be better for your service to have its own standalone CLI. See my other project doclt as an example. This CLI needs to be installed and is intended to be distributed via npm install -g
, but it runs entirely on the user's machine rather than a web server.
IP Banning is a valid concern.
I was thinking you would place and call this service on your machine from your current application (command line or http). My service can cache results and serve those to you from the cache. Maybe refreshing every x hours to see if new articles are available.
The whole idea behind this is that the news aggregation happens on your machine. So you're not dependent on some external API which charges for their services and/or limits your usage.
Caching would actually be more efficient for my service and remove the concern of IP Banning. I could add that.
I really love your getnews.tech service because I can use it with a curl call. I would love to be able to solve this rate limit issue for you. Are you interested in me adding caching to my news service and helping you integrate it into your code?
Edit: Just to clarify, my service is it's own standalone application. You would just install it on your machine and call it via CLI or http.
I work full time, so I can only commit weekend time to working on this, but I'm down to migrate away from NewsAPI to get around this rate limiting issue.
Let's consider the options then:
1) getnews.tech makes a call to your news service, hosted elsewhere which caches results and serves from cache.
2) I host the news service as well locally on the box which getnews.tech is hosted on and send calls to it.
3) Change the news aggregation service to a library instead, which caches results to something like redis (which I am currently using for url shortening).
Let me know what you think, open to your ideas and thoughts!
Sounds good. Let me look into packaging for npm and adding redis caching.
Caching is done. Now I have to see about npm. Not sure I want to package the project directly to npm since that functionality is just for this use case.
If anything I'll make a new project just for npm (w)rapping. I am a fan of hip hop so how hard can it be?
Making a new project SGTM. The logic doesn't sound too hard to port. Link it to me and I'll see what I can do to help.
Just so I'm clear. Is the plan creating a new project and implementing the functionality in javascript or is there some way to wrap the python code in an npm module?
Making a new project with this logic in js so that it can be packaged and distributed via npm sounds the most modular to me.
I don't think it's a good idea to wrap python code with a js interface because clients would need to explicitly depend on python as well.
That makes sense. The issue would be finding a library like newspaper3k that does NLP detection to determine what is article body vs everything else on the page.
Let me take a few days to think about this and do some research.
Do you know which version of node you're using?
I haven't found a library that is high enough quality to detect just the article text in javascript but I have another idea. I was thinking, most of these news websites have an rss feed. I could read through there and grab the info we need.
Also, after doing some preliminary tests with caching and multi threading enabled the response times are still a bit slow. Depending on your internet connection it can take a few seconds to connect to all the pages and grab what you need. I was thinking of exposing this as an API and having you access the data from there. I can host the aggregation API on my servers and offer this as a service to others as well. That way I can write this in any language and do any kind of scheduling in the background to get fresh data every N hours without having all of that running on your machine.
I'd like to hear your thoughts on this
Do you know which version of node you're using?
Just use latest LTS I think, and I'll upgrade to match.
Also, after doing some preliminary tests with caching and multi threading enabled the response times are still a bit slow. Depending on your internet connection it can take a few seconds to connect to all the pages and grab what you need. I was thinking of exposing this as an API and having you access the data from there. I can host the aggregation API on my servers and offer this as a service to others as well. That way I can write this in any language and do any kind of scheduling in the background to get fresh data every N hours without having all of that running on your machine.
This seems okay if you plan on leaving this up in perpetuity. What is the median response time? I think if it's <5s, packaging your code as a library is still acceptable to reduce the number of moving parts. I am okay with having getnews.tech have that amount of latency to query news.
Upon further exploration of RSS feeds I found a tool that delivers news feeds to my terminal. I'm going to depart from further work on this solution because I've solved my problem. Let me know if you still need the code solution and I can try to work on it at a later time.
Getting the following error when calling service: