Closed jfischer closed 12 years ago
Use case:
Users bookmark web pages and these web pages along with related content (source code, javascript, css, images etc.) are downloaded from the web and indexed. Users can then search for text in their bookmarks and the entire webpage is displayed in the results. In this way, users can ensure that they have a snapshot of the document as they saw it. If webpages get changed or deleted, they will still have a working copy.
UI:
The UI includes a standard website to:
In addition, it includes a bookmarklet for adding bookmarks which interfaces with the website.
Blox design:
We will have the following blocks:
Some design parameters:
Implementation:
All the blocks except for indexing and search are done, including the UI. Currently the crawler fetches information right away and the content store uses file based de-duplication strategy.
Evaluation: TBD
Evaluation (performance analysis of Bookmarks application):
Setup:
The application is asked to bookmark a list of 10 URLs in various scenarios:
The following measurements are done:
The application is run on a single node (my computer) and on multiple nodes (2 from MPI cluster).
Results:
Sequential download, my computer:
Parallel download, my computer (2 crawler blocks):
Parallel download, my computer (3 crawler blocks):
Parallel download, my computer (4 crawler blocks):
Re-downloading already downloaded 10 URLs (3 crawler blocks):
Summary: The framework overhead over sequential downloading is 4.76%. This indicates that the framework does not add any significant overhead and hence is applicable for the application.
The framework's advantage kicks in when we allow multiple crawler blocks to run simultaneously, cutting the runtime to almost half. On my internet connection, the networks caps at 2 simultaneous downloads and degrades performance at 4 onwards.
The file based de-duplication strategy works reasonably well for snapshots taken at fairly close intervals. Some of the sites have different advertising images and hence the extra additions.
I observed similar results from MPI machines, but I can't SSH them right now, I'll compile those results tomorrow if I can SSH from UCLA.
Multinode tests using 2 MPI VMs:
Single crawler:
Time to run entire topology with:
MPI servers allow for many more simultaneous downloads and datablox takes advantage of it.
Percentage of time spent on hashing, storing and other framework bookkeeping is 2 sec compared to 51 sec for running the entire topology in the sequential case. Hence the framework overhead is less than 4%.
Need to define the following:
Questions to answer: