iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
473 stars 271 forks source link

Retrieving results after search #429

Open ntorrescsuc opened 4 years ago

ntorrescsuc commented 4 years ago

We instal·led las openwayback version, reindexed all crawled content using CDX and start to search. Reviewing results table after quering for an URLsome of the results has more than one entry for a date when there's only one crawl done using Heritrix, why? Some times more than one date has an ,I was looking for meaning but I can't found information.

ato commented 4 years ago

One possible reason is multiple URLs with slight variants (e.g www vs no-www or http vs https or uppercase vs lowecase) are grouped due to URL canonicalization. Also not impossible Heritrix really did collect the same URL multiple times (check the crawl log).

The * means the content of the page changed on this date as determined by comparing its sha1 digest with the previous snapshot.