netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
101 stars 21 forks source link

Subjective speed up of HTML page display #4

Open tokee opened 6 years ago

tokee commented 6 years ago

When an archived HTML page is displayed, all links its inlined resources needs to be resolved to WARC-files and offsets. For pages with hundreds of such resources, this is a costly process. Currently three methods for resolving are known:

  1. Resolve all links to fileserver:WARC#offset up front and send the re-written HTML to the browser.
  2. Rewrite all links to fileresolver:url:timestamp up front and send the re-written HTML to the browser.
  3. Send the unmodified HTML and use a proxy-server to handle resolving to WARC-entries.

The first method is the fasted, measured in total time, as it performs batch-lookups to map the original links to WARC-entries. Subjectively it is very slow as the user looks at a blank browser-window until the resolving process has finished. The second and third methods are inverse to the first, and might be preferable from an interactive perspective. For thumbnail rendering and similar, the first method is clearly favourable.

A fourth method is hereby suggested, combining the best of both worlds.

a) All links are rewritten to fileresolver:url:timestamp and the re-written HTML is delivered, similar to method #2 above. The URLs are also added to an internal queue in solrwayback. b) solrwayback keeps a background thread running that resolves the fileresolver:url:timestamp URLs to filerserver:WARC#offset. This is done in FIFO-order and in appropriately sized batches, balancing quick resolving of the beginning of the queue with overall throughput. The result is stored in a map acting as lookup-cache. Non-matches are also stored. c) When the browser requests a resource in the form of fileresolver:url:timestamp, it is either delivered from the lookup-cache, added to the resolver queue. With appropriate checks for existence in the resolver queue as well as waits for resolving.

This method makes it possible to tweak interactivity vs. throughput.

thomasegense commented 2 years ago

A quick win will be to have the frontend lazy load images when the record is visible in the browser. That way only the top 2-3 records will load images and not all 20.

1) call to images service for the record 2) fetch images

The frontend already does lazy image loading several other places (like image search and gps search).

tokee commented 2 years ago

I am pretty sure that any modern browser already tries to be smart about the load order of images on webpages :wink:

Image search and GPS search are special as they have delayed definition of the image content (as I understand it).

thomasegense commented 2 years ago

sorry, you talking about playback. I am talking about the image loading for result sets (which is a bottleneck).