mpi-sws-rse / datablox

A dataflow language and runtime
Apache License 2.0
5 stars 1 forks source link

Design for bookmark demo app #28

Closed jfischer closed 12 years ago

jfischer commented 12 years ago

Need to define the following:

Questions to answer:

t-saideep commented 12 years ago

Use case:

Users bookmark web pages and these web pages along with related content (source code, javascript, css, images etc.) are downloaded from the web and indexed. Users can then search for text in their bookmarks and the entire webpage is displayed in the results. In this way, users can ensure that they have a snapshot of the document as they saw it. If webpages get changed or deleted, they will still have a working copy.

UI:

The UI includes a standard website to:

  1. Add bookmarks
  2. List all the archived bookmarks
  3. Once a bookmark is selected, show the archived snapshot of it
  4. Delete bookmarks
  5. Shutdown topology
  6. (Optionally) Search for text in bookmarks

In addition, it includes a bookmarklet for adding bookmarks which interfaces with the website.

Blox design:

We will have the following blocks:

  1. Bookmark client. This passes on the bookmarks clicked by users to crawler. This will be a part of the client UI for bookmarking.
  2. Web crawler: This takes a link, fetches the source code and related content from the web and passes it on to bookmark manager
  3. Hash: Calculates a hash-function for each file. This acts as a key for the Content store and allows the content-store to avoid storing duplicates
  4. Bookmark manager: This takes the link, source code and related content, stores all the content in content store and get the IDs of each file in the store. It stores all the IDs associated with links in a metadata store. It sends the source for indexing.
  5. Content store: Efficiently stores files avoiding duplicates based on the hashes calculated in the hash block. Keeps reference counting on the hashes for deletions.
  6. Metadata store: This is the standard mongodb interface to store metadata information.
  7. (optional) Indexer: This is the standard Solr indexer which can be reused from file analytics example. I think we can make it understand it is saving HTML files or, if not, even indexing the plain html text will do.
  8. (optional) Search: This takes a query, queries Indexer to get results and for each result gets the complete webpage from recovery manager.
  9. Recovery manager: This takes a link, get the related store IDs from Metadata store, fetches URLs of each file from the store, rewrites the HTML file to point to the URLs in the store and returns the page.

Some design parameters:

  1. The crawler can either fetch the pages right away or queue them and fetch every hour. Queuing might improve network performance but this would need more implementation effort.
  2. Content store can choose several strategies for efficient storage: File or block based de-duplication, different replication strategies etc. We already have some support for block based de-duplication and it's not hard to do file based de-duplication. Replication would require more work but we can tackle this as a part of the cloud storage example.

Implementation:

All the blocks except for indexing and search are done, including the UI. Currently the crawler fetches information right away and the content store uses file based de-duplication strategy.

Evaluation: TBD

t-saideep commented 12 years ago

Evaluation (performance analysis of Bookmarks application):

Setup:

The application is asked to bookmark a list of 10 URLs in various scenarios:

  1. Single crawler block (downloads are done sequentially)
  2. Multiple crawler blocks under a shard (parallel downloads are allowed)
  3. The application has already downloaded the 10 URLs previously (to see how efficiently we are storing data)

The following measurements are done:

  1. Total time taken to download the files from web
  2. Total time taken to run the topology
  3. Disk space saved

The application is run on a single node (my computer) and on multiple nodes (2 from MPI cluster).

Results:

Sequential download, my computer:

Parallel download, my computer (2 crawler blocks):

Parallel download, my computer (3 crawler blocks):

Parallel download, my computer (4 crawler blocks):

Re-downloading already downloaded 10 URLs (3 crawler blocks):

Summary: The framework overhead over sequential downloading is 4.76%. This indicates that the framework does not add any significant overhead and hence is applicable for the application.

The framework's advantage kicks in when we allow multiple crawler blocks to run simultaneously, cutting the runtime to almost half. On my internet connection, the networks caps at 2 simultaneous downloads and degrades performance at 4 onwards.

The file based de-duplication strategy works reasonably well for snapshots taken at fairly close intervals. Some of the sites have different advertising images and hence the extra additions.

I observed similar results from MPI machines, but I can't SSH them right now, I'll compile those results tomorrow if I can SSH from UCLA.

t-saideep commented 12 years ago

Multinode tests using 2 MPI VMs:

Single crawler:

Time to run entire topology with:

MPI servers allow for many more simultaneous downloads and datablox takes advantage of it.

Percentage of time spent on hashing, storing and other framework bookkeeping is 2 sec compared to 51 sec for running the entire topology in the sequential case. Hence the framework overhead is less than 4%.