ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Url path parameters indexed in new multivalued field. #184

Open thomasegense opened 6 years ago

thomasegense commented 6 years ago

All playback solutions has problem displaying resources on page if the resource url has a dynamic path parameter because that specific url has not been harvested.

If they were indexed it would be possible to make a lenient playback option that would load the url matching some of the parameters.

Example url: http://test.dk/imagerenerator/test.png?image=horse&timestamp=12345678901234 If the paramters was indexed into a multivalued field: 1) image=horse 2) timestamp=12345678901234

Then it would be easy to match another request like: http://test.dk/imagerenerator/test.png?image=horse&timestamp=11111111111111 if allowing just 1 of 2 parameters should match.

It is possible to do programatically by doing a prefix search for URL and then looking through the result match and do the matching yourself, but this is slower and require a little more code

anjackson commented 6 years ago

n.b. this is not dissimilar to https://github.com/webrecorder/pywb/wiki/Fuzzy-Match-Rules

anjackson commented 6 years ago

Following OH-SOS discussion: Looks like a good idea. Happy to accept a pull request along these lines as long as it's optionally (opt-in) rather than hard-coded.