While the current demo code allows for searching news via NewsAPI - NewsAPI has only one crypto news provider (CCN). It also doesn't allow for fine grained semantic and time based search that a vector db allows.
I've looked into the robot.txt and site HTML of CoinDesk and CoinTelegraph today - and both of them allow scraping their news articles. I also remember having seen some ML researcher talk about more modern / faster methods to do web scraping than the traditional scrapy / BeautifulSoup methods - but I need to find that.
The scraping bot should be a separate repository from Hummingbot AI repository - I intend to make it separate code from the chat agent, and make the news indexing service into some paid API service.
What
Create a news scraper for CoinDesk, CoinTelegraph and CCN. We can use NewsAPI for CCN.
Investigate into a proper vector database to use - need to allow for query via time range and contents.
Save the scraped news to vector database.
Create demo script for RAG-style queries into the database.
Why
While the current demo code allows for searching news via NewsAPI - NewsAPI has only one crypto news provider (CCN). It also doesn't allow for fine grained semantic and time based search that a vector db allows.
I've looked into the robot.txt and site HTML of CoinDesk and CoinTelegraph today - and both of them allow scraping their news articles. I also remember having seen some ML researcher talk about more modern / faster methods to do web scraping than the traditional scrapy / BeautifulSoup methods - but I need to find that.
The scraping bot should be a separate repository from Hummingbot AI repository - I intend to make it separate code from the chat agent, and make the news indexing service into some paid API service.
What