netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
100 stars 21 forks source link

Optional lenient URL resolver #255

Closed tokee closed 1 year ago

tokee commented 1 year ago

Heritrix has problems with <img srcSet...> support: https://github.com/internetarchive/heritrix3/pull/179/ resulting in attempted harvesting of img1%20img2 instead of img1 and img2. Surprisingly it works with some image servers, which will return the image data for img1 when img1%20img2 is requested (Guessing: Because of the space).

NetarchiveSolrClient.findNearestDocuments is used for locating page resources (images, CSS etc.) for playback. We should consider adding an option for lenient resolving: Start by searching for the URL directly and if that fails, split on first space (maybe also comma and semicolon?) and do a wildcard search on the first part, e.g. img1*.

tokee commented 1 year ago

This seems related to URLs that ends with &_=123456789 (used for invalidating browser cache), which can also present a problem. Maybe the lenience handling could be rule based and suggest alternative lookups depending on the originating URL?

"([^ ]*) .*" → "$1*"
"(.*)&_=[0-9]+" → "$1*"
tokee commented 1 year ago

Thinking more about this: The field url_search tokenized the URL and should be usable for this.

Let's say we have the URL http://example.com/images/search?q=horse&fq=animals&_=12345 indexed and we are looking for http://example.com/images/search?q=horse&fq=animals&_=67890 (the ending is different).

We consider the domain and the path to be fixed, with all arguments as optional. This query is possible with

url:"http://example.com/images/search?q=horse&fq=animals&_=67890"^1000
OR
(
  ( domain:"example.com" AND url_search:"example.com/images/search" )
  OR url_search:"q=horse"
  OR url_search:"fq=animals"
  OR url_search:"_=67890"
)

The direct search for url with a massive boost is to ensure that verbatim matching wins. The OR's in the deconstructed url below ensures that none of the URL arguments are required and Solr's ranking should make the closest matching URLs are ranked first. Note that example.com is also part of the first url_search to ensure the path is from the root of the site.

Underneath the hood this will be

url:"http://example.com/images/search?q=horse&fq=animals&_=67890"^1000
OR
(
  ( domain:"example.com" AND url_search:"example com images search" )
  OR url_search:"q horse"
  OR url_search:"fq animals"
  OR url_search:"67890"
)

Notice that _ disappears. The match is not ideal, compared to dedicated argument indexing, but it matches pretty well and (important here) is quite fast and with usable ranking.

thomasegense commented 1 year ago

Has been implemented. Still needs more support for this in GUI.

But for new playback url can be changed from /web/ to /lenient/web/ to try lenient playback. Will take time to test since it required to compare with playback for urls that are poor, and compare with lenient