Open vezaynk opened 5 years ago
This task has very little dependency on anything else in the project and as a result, has it's own project board here: https://github.com/knyzorg/FeedMeLater/projects/2
To aid with parsing pages: https://mercury.postlight.com/web-parser/
Two more options for parsing pages:
Outline: https://outlineapi.com/v2/parse_article?source_url=
Readability: https://github.com/mozilla/readability
Outline seems to have it down to a finer art than anyone else, but it does create a dependency I would want to break away from as soon as possible.
Ideally, crawling would be as easy as grabbing an RSS feed or reading semantic HTML. Sadly, a lot of websites have neither. Sometimes there is an RSS feed but it is not linked in the meta headers.
Most websites have a sitemap which can be located using robots.txt, often with the dates of the pages and as such can be used as a fallback. This however will require further scraping right off the bat.
This process is going to be very resource intensive and not apt for a single-threaded environment like Node. We will need to evaluate the viability of Node multi-threading or use a language which makes it viable such as Rust.