Web Crawling - Githubissues

AlexKashuba commented 5 years ago

Decide on the library that we will use for crawling
Parse a page and extract the keywords
Canonicalize the keywords using an NLP library
Store the link that contains the word in the inverted index
Maintain metrics on the keywords (number of occurrences in the document, number of documents that contain the keyword)
Follow the links on the page and either parse it yourself or delegate the task to a peer based on the hash of the link

AlexKashuba commented 5 years ago

Also need to filter stop words

jakobsvenningsson commented 5 years ago

Possible extension: Add fault tolerance.

Problem: A crawler parses a page and extract the hyperlinks and distributes these to the responsible crawler. If this crawler crashes or leaves before crawling some of the links. It's possible that these pages will never be crawled.

Possible solution: Replication. Instead of only sending the urls to a single crawler we send them to K crawlers. In order to avoid duplicate crawls of the same page every crawler hashes the page they have crawled and saves this in the DHT. Each crawler makes a lookup before parsing any newly downloaded pages and discards it if notices that it is already present in the DHT.

mvidigueira / Peerster

Web Crawling #2