Crawling Strategies - Githubissues

vezaynk commented 5 years ago

Ideally, crawling would be as easy as grabbing an RSS feed or reading semantic HTML. Sadly, a lot of websites have neither. Sometimes there is an RSS feed but it is not linked in the meta headers.

Most websites have a sitemap which can be located using robots.txt, often with the dates of the pages and as such can be used as a fallback. This however will require further scraping right off the bat.

This process is going to be very resource intensive and not apt for a single-threaded environment like Node. We will need to evaluate the viability of Node multi-threading or use a language which makes it viable such as Rust.

vezaynk commented 5 years ago

This task has very little dependency on anything else in the project and as a result, has it's own project board here: https://github.com/knyzorg/FeedMeLater/projects/2

vezaynk commented 5 years ago

To aid with parsing pages: https://mercury.postlight.com/web-parser/

vezaynk commented 5 years ago

Two more options for parsing pages:

Outline: https://outlineapi.com/v2/parse_article?source_url= Readability: https://github.com/mozilla/readability

Outline seems to have it down to a finer art than anyone else, but it does create a dependency I would want to break away from as soon as possible.

vezaynk / FeedMeLater

Crawling Strategies #1