iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

Replace the org.json dependency by openjson library #84

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

Replace the org.json dependency by openjson to

ato commented 4 years ago

I might be missing it but I don't see a claim in either article that the JSON license is actually incompatible with the Apache license, merely that it doesn't meet the FSF and OSI definitions of free or open source and therefore is not allowed to be used by official Apache Software Foundation projects and is treated as non-free by Debian. Although there's obviously an expectation that webarchive-commons is open source and therefore ideally all its dependencies are as well.

This patch will have some level of impact in dependent applications such as Heritrix and OpenWayback. At very least both rely on the json.org library being pulled in as a transitive dependency. I'm not sure yet whether it will cause an API incompatibility, most of the usages in webarchive-commons seem internal.

The danger is that this forces Heritrix to change to openjson and that in turn breaks everyone's custom modules. It's a pity they used a different package name as otherwise it might actually be drop in bytecode compatible. It appears openjson was forked from com.tdunning/json which actually does use the org.json package name:

https://mvnrepository.com/artifact/com.tdunning/json/1.8

We could try using that instead if this change does lead to an API breakage headache.

@ldko, @nlevitt, @anjackson do you have any thoughts?

sebastian-nagel commented 4 years ago

Hi @ato, the option to rely on Ted Dunning's package has been discussed in commoncrawl/ia-web-commons#17 and tdunning/open-json#13. It was the initial trial but turned out to be significantly slower.

nlevitt commented 4 years ago

https://en.wikipedia.org/wiki/Douglas_Crockford#.22Good.2C_not_Evil.22

Affected open source developers have asked Crockford to change the license,[16][17][18] but he has generally refused to do so.[19 ] He has, however, granted "IBM, its customers, partners, and minions" permission "to use JSLint for evil", a solution which appeared to satisfy IBM's lawyers.[20]

Maybe we can also get a special allowance to do evil? :)

I added json-simple as a dependency to heritrix-contrib at one point: https://github.com/internetarchive/heritrix3/pull/242 I don't remember why and apparently didn't it write down. :(

My feeling is pretty much that the license thing is a non-issue. Nobody's suing anybody over this. I haven't heard of anybody being prevented from using heritrix it. The only way I can imagine it becoming an issue is if people (i.e. us) start making noise about it.

If this is really about performance, do we have any numbers from benchmarking?

sebastian-nagel commented 4 years ago

the license thing is a non-issue

One point is that Apache projects (hosted by the ASF) must not depend on webarchive-commons resp. need to fix the license issue by managing transitive dependencies and replacing org.json by a bytecode compatible package.

That's possible and was one of the trials in commoncrawl/ia-web-commons#16. @ato, I understand that the potential issues in downstream projects (namely heritrix3) can be blocker.

If this is really about performance, do we have any numbers from benchmarking?

See the profiles in commoncrawl/ia-web-commons#16 - the JSON serialization is almost twice as fast, the entire WARC -> WAT conversion becomes 6-7% faster.

tfmorris commented 5 months ago

Since the original discussion, Doug Crockford has dedicated the JSON library to the Public Domain, which should make it acceptable to most open source organizations. https://github.com/stleary/JSON-java/pull/688