pkweitai / hummingbotAI

Hummingbot AI enablement contributions from community
Other
6 stars 0 forks source link

[feature] Create a news scraper and vector db indexer for CoinDesk, CoinTelegraph and CCN #13

Open martinkou opened 5 months ago

martinkou commented 5 months ago

Why

While the current demo code allows for searching news via NewsAPI - NewsAPI has only one crypto news provider (CCN). It also doesn't allow for fine grained semantic and time based search that a vector db allows.

I've looked into the robot.txt and site HTML of CoinDesk and CoinTelegraph today - and both of them allow scraping their news articles. I also remember having seen some ML researcher talk about more modern / faster methods to do web scraping than the traditional scrapy / BeautifulSoup methods - but I need to find that.

The scraping bot should be a separate repository from Hummingbot AI repository - I intend to make it separate code from the chat agent, and make the news indexing service into some paid API service.

What

martinkou commented 5 months ago

Some modern scraper options to consider

https://jina.ai/reader/ https://github.com/trancethehuman/entities-extraction-web-scraper