ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Protocol as field (http/https etc.) #185

Open thomasegense opened 5 years ago

thomasegense commented 5 years ago

It would be really useful if there was field for the protocol. The reason for that is that the following two urls has the exact same url_norm: http://test.uk/ https://test.uk/ url_norm is doing this correct. But when I have to resolve a given url in SolrWayback I use the url_norm and can not see the difference between the above urls during playback. Knowing it was HTTPS etc. would solve the problem. Using the url field instead is not an option, since I rely on all the heavy lifting done by the normalization.

anjackson commented 5 years ago

I would usually just look up the scheme using the un-normalised url field. Is there a big advantage in having a separate field? i.e. look up using the normalised version and then pull the un-normalised versions to use.

Or, in other words, I don't understand:

Using the url field instead is not an option, since I rely on all the heavy lifting done by the normalization.

thomasegense commented 5 years ago

Again this is a playback issue with a performance impact. I have to lookup given url_norm (since only url_norm is reliable), but I can see it is an https also. So this search would work: url_norm:"http://test.uk" AND url:"https*"
With the last wildcard search. This does not perform. The way I handle it now is to search for url_norm:"http://test.uk" and then parse though the search results and filter away the wrong protocol. This performs but is not pretty.

The best way would be a single search: url_norm:"http://test.uk" AND protocol:"HTTPS"

Hope that explains.

tokee commented 5 years ago

@thomasegense I think the problem is why you need to know if it is http/https? What do you need it for?

thomasegense commented 5 years ago

Any search using the field url:.... is not reliable and would miss other representations of that url that a browser would resolve to the same page. Which is why we have the url_norm field. Everything it does not capture is if it is HTTP or HTTPS and this does matter. Browser will return different content. It will not return different content if you just use different url encodings of the same url.

anjackson commented 5 years ago

Hm, I'm still not getting it. I thought Heritrix assumed HTTP HTTPS and all wwwN? variants return the same content and only downloads one of them, relying on URL canonicalisation to route the user to the content...

Hm, actually, now I mention it I'm not sure. I think we record HTTP-to-HTTPS redirects, so we must treat them as different URLs during the crawl.

Okay, so maybe there are two different normalisations. One for "users' probably mean this when they search" and another for "browsers mean this".

thomasegense commented 5 years ago

No, the even Heritrix has several inconsistency when writing the URL field. I have made a small web-site ('evil web-site') and harvested it and seen the errors.

But the URL's I have to look up does not even always come from the index. They come from "a href"-tag(or img or script...) from a given webpage when I have to make playback of that url. I have have to resolve all URL's on the page so see if they are in the index. I can playback some websites that OpenWayback can not do, because it does not use the normalization.