iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 71 forks source link

Make canonicalizer be able to strip session id params even if they ar… #54

Closed vonrosen closed 8 years ago

vonrosen commented 8 years ago

…e the first params in the query string. And add session id strip test. And change IAURLCanonicalizer.java to ensure that if after transformations on the query string have completed and the query is empty, there is not a ? added to the end of the url.

johnerikhalse commented 8 years ago

Will this require reindexing CDX'es?

If that's the case I will propose this goes into at least a minor release, or maybe a major release, and not into a bugfix release.

If IAURLCanonicalizer.java is not the default for CDX-indexer, OpenWayback and CDX-Server, then this change should be fine.

kris-sigur commented 8 years ago

Yes, this should probably be deferred to 1.2 as it changes existing behavior.

Alternatively, if this is an IA only class, perhaps it should be deprecated?

johnerikhalse commented 8 years ago

After reviewing OWB, it seems to me that the default configuration is not using IAURLCanonicalizer. The exception is CDX-Server, but since using CDX-Server is not the default at the moment, I think this PR is ok for the next bugfix release.

kris-sigur commented 8 years ago

Merged, mostly on the understanding that only IA uses this. Canonicalization really needs to be better standardized.