joshua-decoder / joshua

Joshua Statistical Machine Translation Toolkit
http://joshua-decoder.org/
121 stars 56 forks source link

Any interest in bringing the project to Apache? #204

Closed chrismattmann closed 8 years ago

chrismattmann commented 9 years ago

Hi @mjpost and others of the Joshua community. Is there any interest in the project coming to the Apache Software Foundation? I brought this up offlist to Matt and there was some interest, but I never followed up so thought I would do so publicly and transparently here.

Apache has a guide for new projects: http://incubator.apache.org/guides/proposal.html

I would be very happy to champion this project in the ASF if there is interest.

mjpost commented 9 years ago

Hi @chrismattmann — I think it would be hard to break away from JHU at this point, but I wouldn't say that it's an impossibility. The costs seem clear to me (loss of control); can you help us understand the benefits (and perhaps present a more complete pictures of the costs as well)?

CC: @callison-burch

chrismattmann commented 9 years ago

no problem at all @mjpost . The reality is you wouldn't really lose control - take a look at the Apache ICLA (you license your contributions to the ASF). However, if you scope out the ASF project management committee, the ASF is really a home for independent, separately managed PMCs.

PMCs are autonomous entities that share a belief in open source that we call the "Apache Way". these are loose set of principles that keep us together:

  1. release software under the permissive ALv2 license.
  2. ensure diversity on the PMC which helps sustainability and resiliency against the loss of contributors, funding, etc., and helps ensure the project goes on.
  3. encourage addition of new contributors over time
  4. encourage releases over time
  5. leveraging of ASF trademarks, legal, and PR/marketing support and support from the 4000+ contributors there

Many science projects are looking at the ASF especially in the age of dwindling grants, etc. Have a look at http://ctakes.apache.org/ and http://airavata.apache.org which came out of the DHHS (SHARP initiative) and NSF, respectively. (XSEDE) http://oodt.apache.org/ is another example from NASA and there are more coming like OCW http://climate.apache.org and CMDA (Climate Model Diagnostic Analyzer).

Here are some refs: http://www.apache.org/foundation/how-it-works.html http://www.apache.org/dev/pmc.html http://www.apache.org/dev/new-committers-guide.html http://community.apache.org

I'd also be happy to help.

chrismattmann commented 9 years ago

cc @lewismc

lewismc commented 9 years ago

I am very interested in this folks.

On Friday, June 19, 2015, Chris Mattmann notifications@github.com wrote:

cc @lewismc https://github.com/lewismc

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#issuecomment-113558987 .

Lewis

callison-burch commented 9 years ago

This sounds good to me too.

On Jun 19, 2015, at 4:03 PM, Lewis John McGibbney notifications@github.com wrote:

I am very interested in this folks.

On Friday, June 19, 2015, Chris Mattmann notifications@github.com wrote:

cc @lewismc https://github.com/lewismc

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#issuecomment-113558987 .

Lewis — Reply to this email directly or view it on GitHub.

lewismc commented 9 years ago

@mjpost in addition to @chrismattmann comment, you mentioned costs. In terms of financial aspect of infrastructure... WebSite, CI, CMS, SCM, Mailing Lists, etc. well that is all facilitated by the foundation. Another advantage of joining TheASF is that Joshua would most likely have more cross community collaboration with other machine learning folks over in Apache Mahout, Apache Spark, and cTakes etc. as mentioned by @chrismattmann.

Oh by the way, did I also mention a small insignificant project called Apache Tika? ;) I think it is fair to say that Chris and myself would both very much like to see Joshua come to TheASF and grow. The Hadoop aspect of Joshua codebase would undoubtedly improve pretty radically as swell once Joshua starts releasing and announcing.

lewismc commented 9 years ago

BTW, bq. Oh by the way, did I also mention a small insignificant project called Apache Tika? ;) This is a joke, we would love to have better integration with Joshua over in Apache Tika. Tika is a very well used library and an excellent, dynamic, bustling community. Joshua would certainly benefit from better engagement.

mjpost commented 9 years ago

Okay, this seems pretty appealing. I have a licensing question, though. Joshua contains an LGPL'd library for handling language models (KenLM). There is an alternative (BerkeleyLM), but it is not actively maintained any more and is not quite as good as KenLM in a few key respects. A quick glance at the incubator page suggests that this dependency would keep the project from becoming a full-fledged one. Can you comment on this?

callison-burch commented 9 years ago

We could also ask Kenneth if he would consider offering it with an Apache license.

On Jun 20, 2015, at 5:27 PM, Matt Post notifications@github.com wrote:

Okay, this seems pretty appealing. I have a licensing question, though. Joshua contains an LGPL'd library for handling language models (KenLM). There is an alternative (BerkeleyLM), but it is not actively maintained any more and is not quite as good as KenLM in a few key respects. A quick glance at the incubator page suggests that this dependency would keep the project from becoming a full-fledged one. Can you comment on this?

— Reply to this email directly or view it on GitHub.

lewismc commented 9 years ago

@Chris, this is a very important suggestion. An initial path which I've pursued is to ask the entire Apache incubator community for an alternative to the library Joshua currently consumes [0]. Licensing over time can and does become an issue as you guys know [1]. I would like to mention that although the library is an issue, this is not a blocker at all for building stronger community around Joshua. Lets see how the Incubator thread goes. Lewis

[0] http://www.mail-archive.com/general%40incubator.apache.org/msg49043.html [1] http://www.apache.org/licenses/GPL-compatibility.html

On Sat, Jun 20, 2015 at 5:52 PM, Chris Callison-Burch < notifications@github.com> wrote:

We could also ask Kenneth if he would consider offering it with an Apache license.

On Jun 20, 2015, at 5:27 PM, Matt Post notifications@github.com wrote:

Okay, this seems pretty appealing. I have a licensing question, though. Joshua contains an LGPL'd library for handling language models (KenLM). There is an alternative (BerkeleyLM), but it is not actively maintained any more and is not quite as good as KenLM in a few key respects. A quick glance at the incubator page suggests that this dependency would keep the project from becoming a full-fledged one. Can you comment on this?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#issuecomment-113824895 .

Lewis

mjpost commented 9 years ago

Any thoughts on this, @kpu?

chrismattmann commented 9 years ago

Hey Guys yeah it's not a total blocker. We've dealt with similar issues e.g., with Apache OpenOffice which had a strict dependency on LGPL dictionaries and so forth and the Apache legal committee granted an exception (not broad, but in that particular situation). We could ask for a similar exception during Incubation and get it sorted out. If @kpu is willing to relicense, of course that would be awesome. In addition the dependency is a runtime dependency, not a static binding one, right, @mjpost ?

lewismc commented 9 years ago

ACK

On Sat, Jun 20, 2015 at 7:55 PM, Chris Mattmann notifications@github.com wrote:

Hey Guys yeah it's not a total blocker. We've dealt with similar issues e.g., with Apache OpenOffice which had a strict dependency on LGPL dictionaries and so forth and the Apache legal committee granted an exception (not broad, but in that particular situation). We could ask for a similar exception during Incubation and get it sorted out. If @kpu https://github.com/kpu is willing to relicense, of course that would be awesome. In addition the dependency is a runtime dependency, not a static binding one, right, @mjpost https://github.com/mjpost ?

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#issuecomment-113842623 .

Lewis

mjpost commented 9 years ago

KenLM actually contains several components that are important to the Joshua tool chain. We use it for building language models during training (via the lmplz binary, which itself has a boost dependency), the build_binary component to compile the resulting ARPA-formatted text file to a packed trie, and then a library ($JOSHUA/lib/libken.so) to efficiently query that file via a JNI bridge. There are alternatives to all of these, but they are not as good.

But to answer your question, yes, the most important of these (the library) is a dynamic dependency.

chrismattmann commented 9 years ago

Thanks @mjpost that helps to answer it. Because it's a runtime dependency we can likely get an exception to this and deal with it via intelligent packaging and so forth. This isn't a blocker at all.

kpu commented 9 years ago

For purposes of Joshua, it's standalone executables that could just be documented with a pointer, a shared library via JNI, and a bit of Java-side wrapper code. The Java code can just be considered part of Joshua.

The question is whether it's up to me to relicense. I do not have contributor license agreements in place. There are several contributors with varying employers who may or may not have a work-for-hire claim. My guess is most do not care what open-source license is used, but they would likely need to be contacted.

Also, don't forget your other LGPL dependency: https://twitter.com/joshuadecoder/status/563072947586613248 "@mosessmt Also we make heavy use of your pipeline, keep the improvements coming on that :)"

mjpost commented 9 years ago

The Moses dependency is a good point — we have occasionally borrowed scripts and other miscellaneous tools, and currently rely on Moses for a portion of the phrase-based model building. However, removing that piece entirely wouldn't be too much work.

There are other important tools, however, such as word aligners (GIZA++ and the Berkeley aligner), which are both GPL licensed. I have packaged a lot of this with Joshua in order to remove external dependencies, and try to make it easier for people to build models from start to finish. Airing all of this would require a close look at the whole pipeline. Much of it could be replaced if there were more hands on deck, or also just left to the user to install.

chrismattmann commented 9 years ago

yep @mjpost - leaving things to the user to install and not packaging directly with Joshua but making intelligent packaging tools is a common practice in a lot of these situations and we could employ them here. @kpu thanks for the quick reply. If you are able to relicense, something on the permissive end ALv2, BSD and/or MIT would be much appreciated. see: http://www.apache.org/legal/resolved.html#category-a

lewismc commented 9 years ago

ACK

On Sat, Jun 20, 2015 at 8:35 PM, Chris Mattmann notifications@github.com wrote:

yep @mjpost https://github.com/mjpost - leaving things to the user to install and not packaging directly with Joshua but making intelligent packaging tools is a common practice in a lot of these situations and we could employ them here. @kpu https://github.com/kpu thanks for the quick reply. If you are able to relicense, something on the permissive end ALv2, BSD and/or MIT would be much appreciated. see: http://www.apache.org/legal/resolved.html#category-a

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#issuecomment-113844601 .

Lewis

kpu commented 9 years ago

Moving the JNI code over to kenlm (and preferably cleaning it up so it's generically useful, not just for Joshua) would solve most of this.
The following places might have to agree to a license change: CMU, Edinburgh, Adam Mickiewicz University, Stanford, Bloomberg, NAIST, UIUC, Yandex, and SDL.

chrismattmann commented 9 years ago

Thanks @kpu - not a blocker at all and this can be dealt with during incubation. @mjpost let me know what you think about moving forward - maybe we can wait a few days get some more feedback and proceed or not

mjpost commented 9 years ago

I'm considering it positively. Can it wait till early August (i.e., six weeks)? I am involved with a summer workshop which is consuming most of my time at the moment (including keeping me from reading thoroughly through your docs).

I suppose this move would also involve changing hosting? Does Apache support git? Skimming, I only saw notes about using SVN....

(I can likely answer these questions myself and will have more time to do so a week from now, if you want to let them lie)

chrismattmann commented 9 years ago

hey @mjpost thanks. Sure we can revisit in early August, that's totally fine, no rush. Apache does support Git, and it even has writeable git repositories, and mirrors out to Github. So joshua would move to Apache writeable Git at something like https://git-wip-us.apache.org/repos/asf/joshua.git (for a working link, see: https://git-wip-us.apache.org/repos/asf/tajo.git ) and then could be mirrored to Github at http://github.com/apache/joshua.

Talk soon. Thanks!

mjpost commented 9 years ago

Okay, great, I've made a note to come back to this then.

lewismc commented 9 years ago

Sounds good @mjpost

chrismattmann commented 9 years ago

Should this be left open its not really closed yet? @mjpost

lewismc commented 9 years ago

Chris, the thread on general@ started off well then diverged at an incredible rate. That being said, up until just after your reply provides us with valuable commentary fro general@ community to progress with the dependnency issues. I'll make an effort to set the options out below (it will not be an exhaustive list... There is more than one way to skin a cat)

On Wednesday, July 1, 2015, Matt Post notifications@github.com wrote:

Reopened #204 https://github.com/joshua-decoder/joshua/issues/204.

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#event-345642372.

Lewis

chrismattmann commented 9 years ago

hi @mjpost ready to pick this back up?

lewismc commented 8 years ago

@chrismattmann close this off ;)

chrismattmann commented 8 years ago

yep we can close this Joshua is now an Apache Incubator podling! :) https://issues.apache.org/jira/browse/INFRA-11264

lewismc commented 8 years ago

Dynamite

On Saturday, February 13, 2016, Chris Mattmann notifications@github.com wrote:

Closed #204 https://github.com/joshua-decoder/joshua/issues/204.

— Reply to this email directly or view it on GitHub https://github.com/joshua-decoder/joshua/issues/204#event-549708296.

Lewis