Design for bookmark demo app

jfischer commented 12 years ago

Need to define the following:

Use cases / user stories describing how this would be used
Web UI pages that would need to be created
Block topology needed to support this scenario
What are we saving from the original web pages (e.g. can we get linked stylesheets and images)
Follow on features (e.g. de-duping of files, particularly images).

Questions to answer:

What new blocks do we need to create (if any)
What new infrastructure is needed?
What different configurations should be built in the first pass?
How do we evaluate the implementation?

t-saideep commented 12 years ago

Use case:

Users bookmark web pages and these web pages along with related content (source code, javascript, css, images etc.) are downloaded from the web and indexed. Users can then search for text in their bookmarks and the entire webpage is displayed in the results. In this way, users can ensure that they have a snapshot of the document as they saw it. If webpages get changed or deleted, they will still have a working copy.

UI:

The UI includes a standard website to:

Add bookmarks
List all the archived bookmarks
Once a bookmark is selected, show the archived snapshot of it
Delete bookmarks
Shutdown topology
(Optionally) Search for text in bookmarks

In addition, it includes a bookmarklet for adding bookmarks which interfaces with the website.

Blox design:

We will have the following blocks:

Bookmark client. This passes on the bookmarks clicked by users to crawler. This will be a part of the client UI for bookmarking.
Web crawler: This takes a link, fetches the source code and related content from the web and passes it on to bookmark manager
Hash: Calculates a hash-function for each file. This acts as a key for the Content store and allows the content-store to avoid storing duplicates
Bookmark manager: This takes the link, source code and related content, stores all the content in content store and get the IDs of each file in the store. It stores all the IDs associated with links in a metadata store. It sends the source for indexing.
Content store: Efficiently stores files avoiding duplicates based on the hashes calculated in the hash block. Keeps reference counting on the hashes for deletions.
Metadata store: This is the standard mongodb interface to store metadata information.
(optional) Indexer: This is the standard Solr indexer which can be reused from file analytics example. I think we can make it understand it is saving HTML files or, if not, even indexing the plain html text will do.
(optional) Search: This takes a query, queries Indexer to get results and for each result gets the complete webpage from recovery manager.
Recovery manager: This takes a link, get the related store IDs from Metadata store, fetches URLs of each file from the store, rewrites the HTML file to point to the URLs in the store and returns the page.

Some design parameters:

The crawler can either fetch the pages right away or queue them and fetch every hour. Queuing might improve network performance but this would need more implementation effort.
Content store can choose several strategies for efficient storage: File or block based de-duplication, different replication strategies etc. We already have some support for block based de-duplication and it's not hard to do file based de-duplication. Replication would require more work but we can tackle this as a part of the cloud storage example.

Implementation:

All the blocks except for indexing and search are done, including the UI. Currently the crawler fetches information right away and the content store uses file based de-duplication strategy.

Evaluation: TBD

t-saideep commented 12 years ago

Evaluation (performance analysis of Bookmarks application):

Setup:

The application is asked to bookmark a list of 10 URLs in various scenarios:

Single crawler block (downloads are done sequentially)
Multiple crawler blocks under a shard (parallel downloads are allowed)
The application has already downloaded the 10 URLs previously (to see how efficiently we are storing data)

The following measurements are done:

Total time taken to download the files from web
Total time taken to run the topology
Disk space saved

The application is run on a single node (my computer) and on multiple nodes (2 from MPI cluster).

Results:

Sequential download, my computer:

Total download time: 39.6 sec
Time to run the entire topology: 42 sec
No disk space saved as all files are new

Parallel download, my computer (2 crawler blocks):

Total download time: 40.5 sec
Time to run the entire topology: 25 sec
No disk space saved as all files are new

Parallel download, my computer (3 crawler blocks):

Total download time: 55 sec
Time to run the entire topology: 24 sec
No disk space saved as all files are new

Parallel download, my computer (4 crawler blocks):

Total download time: 137.7 sec
Time to run the entire topology: 44 sec
No disk space saved as all files are new

Re-downloading already downloaded 10 URLs (3 crawler blocks):

Total download time: 61 sec
Time to run the entire topology: 30 sec
Disk space saved 82.7% (original 10 URLs took 2.9 MB and the total 20 URLs took 3.4 MB with 5 files getting added)

Summary: The framework overhead over sequential downloading is 4.76%. This indicates that the framework does not add any significant overhead and hence is applicable for the application.

The framework's advantage kicks in when we allow multiple crawler blocks to run simultaneously, cutting the runtime to almost half. On my internet connection, the networks caps at 2 simultaneous downloads and degrades performance at 4 onwards.

The file based de-duplication strategy works reasonably well for snapshots taken at fairly close intervals. Some of the sites have different advertising images and hence the extra additions.

I observed similar results from MPI machines, but I can't SSH them right now, I'll compile those results tomorrow if I can SSH from UCLA.

t-saideep commented 12 years ago

Multinode tests using 2 MPI VMs:

Single crawler:

Total download time: 49 sec
Time to run the entire topology: 51 sec

Time to run entire topology with:

4 crawlers: 18 sec
8 crawlers: 13 sec
12 crawlers: 15 sec

MPI servers allow for many more simultaneous downloads and datablox takes advantage of it.

Percentage of time spent on hashing, storing and other framework bookkeeping is 2 sec compared to 51 sec for running the entire topology in the sequential case. Hence the framework overhead is less than 4%.

mpi-sws-rse / datablox

Design for bookmark demo app #28