mvidigueira / Peerster

Couse Project - Decentralized Systems Engineering 18 (EPFL)
1 stars 0 forks source link

Web Crawling #2

Open AlexKashuba opened 5 years ago

AlexKashuba commented 5 years ago
AlexKashuba commented 5 years ago

Also need to filter stop words

jakobsvenningsson commented 5 years ago

Possible extension: Add fault tolerance.

Problem: A crawler parses a page and extract the hyperlinks and distributes these to the responsible crawler. If this crawler crashes or leaves before crawling some of the links. It's possible that these pages will never be crawled.

Possible solution: Replication. Instead of only sending the urls to a single crawler we send them to K crawlers. In order to avoid duplicate crawls of the same page every crawler hashes the page they have crawled and saves this in the DHT. Each crawler makes a lookup before parsing any newly downloaded pages and discards it if notices that it is already present in the DHT.