ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Better handling of /index.html URLs #52

Open anjackson opened 4 years ago

anjackson commented 4 years ago

We have apparent 'gaps' under OutbackCDX + pywb, as records for

https://www.webarchive.org.uk/wayback/archive/*/http://www.webarchive.org.uk/index.html

are separate from records for

https://www.webarchive.org.uk/wayback/archive/*/http://www.webarchive.org.uk/

Our users expect to see these together. This seems to be a URL canonicalisation issue with OutbackCDX, but I'm recording the issue here for now.

ldbiz commented 11 months ago

Issue a few years old - is it still to be investigated here?

anjackson commented 11 months ago

This is related to https://github.com/ukwa/ukwa-services/issues/81 and https://github.com/ukwa/w3act/issues/614

I experimented with a solution to this for ukwa/ukwa-services#81, by adding an alias record to OutbackCDX. But I got stuck because I wasn't sure which URLs were aliases. This might be a good thing to work through with @nicolabingham ?