iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

Switch to Google Guava for public suffix API #4

Open anjackson opened 10 years ago

anjackson commented 10 years ago

While porting for #1, this happened:

One issue I noticed was that the archive-access code brings in entire heritrix-commons just for one class, which appears to be quite general purpose:

import org.archive.net.PublicSuffixes;

(indeed, there is a Google Guava class that does pretty much the same thing). This seems a little over the top, so I copied the PublicSuffixes to iipc-web-commons under the org.archive.url package, along with the corresponding unit tests and effective_tld data file.

This is rather clumsy, and given this is provided by Google Guava, there seems little point maintaining our own code (assuming theirs is kept up to date). The task is then to check that the Google one is well maintained and switch over to that instead of copying in code from elsewhere.

nlevitt commented 9 years ago

I'm not seeing what #12 has to do with this?

kris-sigur commented 9 years ago

My mistake

johnerikhalse commented 9 years ago

I had a look at Guava. I assume that it is com.google.common.net.InternetDomainName @anjackson has in mind. As far as I can see this class is not a real replacement for org.archive.net.PublicSuffixes. The latter is looking up real world suffixes from https://publicsuffix.org/ while the former is just evaluating patterns and have no knowledge if the domain names are real.

anjackson commented 9 years ago

Ah, so, I was going on the basis of this documentation that indicated that the Guava classes also just used the publicsuffix.org list. Maybe that documentation is out of date?

johnerikhalse commented 9 years ago

My fault. You are right I didn't read the code well enough.

Regarding which implementation is best maintained, both implementations uses a local copy of the list from publicsuffix.org (PSL). Webarchive-commons' list was last updated sometime late in 2013, while Guava master branch was updated August 20, 2014. In both cases freshness is dependent on the release frequency of the library and that we always depend on the latest version.

For freshness, I think moving to Guava is good. But if we are using other part of Guava and that part gets API-changes, then we must update our code just to get the updated PSL. This is probably not a big problem assuming Guava is concerned about backward compatibility.

johnerikhalse commented 9 years ago

I've found that Heritrix is already using com.google.common.net.InternetDomainName. I think we should do the move as well.