Open AlexKashuba opened 5 years ago
Also need to filter stop words
Possible extension: Add fault tolerance.
Problem: A crawler parses a page and extract the hyperlinks and distributes these to the responsible crawler. If this crawler crashes or leaves before crawling some of the links. It's possible that these pages will never be crawled.
Possible solution: Replication. Instead of only sending the urls to a single crawler we send them to K crawlers. In order to avoid duplicate crawls of the same page every crawler hashes the page they have crawled and saves this in the DHT. Each crawler makes a lookup before parsing any newly downloaded pages and discards it if notices that it is already present in the DHT.