laito / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

restructure ClearTK-toolkit into smaller sub-projects by dependencies and/or functionality #160

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Philip:

 I would like to propose a pretty significant change to the way our code is organized.  Here are my motivations:

- simplify dependencies so that they are more localized to the ClearTK code 
that is dependent on them
- simplify license compliance with the various libraries we are using
- make the code more modular so that it is straightforward to use only what is 
needed

In yesterday's email I gave a quick summary of how I would like to see the 
framework project split up (see below).  I think this will go a long way 
towards easing peoples fears of open source license entanglements as well is 
improve our ability to communicate what the implications are of using the 
various wrappers.  This change seems like a slam dunk to me - so I won't spend 
time justifying why we need to do this unless objections are raised.

Similarly, I would like to see the ClearTK toolkit project split up into 
meaningful subprojects.  I have a use case where I would like to use just the 
tokenizer and part-of-speech tagger in ClearTK.  Should I really have to import 
all of ClearTK along with the mountain of dependencies to do this?  no.  Here's 
a sketch of how the projects might be split up:

- ClearTK-tokens
  - tokenization
  - stemming
  - part-of-speech tagging
  - sentence segmentation

- ClearTK-corpora
  - all the code we've written for parsing various formats

- ClearTK-semantic-roles

- ClearTK-syntax
   - type system
   - feature extractors

-ClearTK-temporal

- ClearTK-util
   - collection readers

I think some of the code that is currently in the toolkit project should not 
survive the restructuring.  The code in the ne package is crap and barely does 
anything, for example.  All of the OpenNLP wrappers need to disappear or go 
into a separate project(s).  Certainly, we could write our own sentence 
segmenter.  The parser wrapper could be put in a subproject that depends on 
ClearTK-syntax.  I have a wrapper for the Berkeley Parser - but it is GPL and 
needs to be a separate project that also depends on ClearTK-syntax.  Jinho has 
a dependency parser that will almost certainly get a ClearTK/uima wrapper - if 
not be refactored to actually use ClearTK.  So, ClearTK-dependency-parsing 
would be a logical subproject.

So, there are a lot of details I've glossed over.  But does the rough sketch 
make sense? 

Philipp:

we talked about this before, so you know that I basically agree with you. I 
think splitting up the framework by dependencies (especially on the ML 
libraries) is the obvious thing to do. I'm not entirely sure how the toolkit 
should be split up. On the one hand, I think we also want to isolate major 
dependencies from each other where we can, but it also makes sense to group 
things by task. I'm not entirely convinced by your proposed split below; for 
example, it would seem that the CoNLL'05 collection reader should then go into 
"corpora", but it could just as well go into "semantic roles", because it's so 
tightly linked to that task. And if we group collection readers in "corpora", 
why is there another set of collection readers in "util"?

As an alternative / modification, how about a "collection readers" package, 
which contains only the generic collection readers (like FileSystem), which are 
general purpose and don't depend on any specific types; separate "tokens", 
"semantic roles", "syntax", "temporal", ... packages for the separate tasks, 
each of which may contain type system additions, and collection readers that 
are either specific to the task (e.g CoNLL'05), or they require the task's type 
system to function (e.g. Treebank in "syntax"), but these packages only contain 
ClearTK implementations; and then packages for each of the major dependencies 
(such as OpenNLP), which may contain components for multiple tasks (and thus 
may depend on the task-specific ClearTK packages).

Philip:
There are two ways that dependencies can be a pain - license incompatibilities 
and version incompatibilities.  If I download a package that is dependent on 
the universe and all of its dependencies are out-of-date with ones I have 
already added to my package - this is really annoying.  But this is less of a 
show-stopper than having licensing issues with dependencies.  I want it to be 
easy for people to download and use ClearTK without worrying about dependencies 
with other licenses.  Code that depends on an LGPL library for example, will 
have to be moved into its own library.  However, if we simply structure 
sub-projects based on dependencies we will end up with something rather 
unintuitive and difficult to navigate I think.  So, there will be some art to 
how we restructure everything and we will just have to discuss it when we get 
started on it.  

Original issue reported on code.google.com by pvogren@gmail.com on 9 Sep 2010 at 3:18

GoogleCodeExporter commented 9 years ago
I have made good progress on this issue this week.  I have a first pass of the 
toolkit project reorg complete now.  All the projects are in place, the code is 
compiling, and the tests are passing.  I still have a laundry list of things I 
need to take care of before I merge this branch back into trunk.  So, now is a 
good time to take a look at this branch of give me some feedback.  There will 
still be opportunities to rearrange things after its moved back to trunk too.  
I would like to do the merge and create a new cleartk release next week.  

Original comment by pvogren@gmail.com on 30 Dec 2010 at 9:00

GoogleCodeExporter commented 9 years ago
Seems to be some problems still. Loading into a new workspace, I get some 
errors like:

Missing artifact edu.umass.cs.mallet:grmm-mod:jar:0.1.3:compile
Project 'cleartk-util' is missing required source folder: 
'target/generated-sources/jcasgen'

I think the former may be an issue of not listing the cleartk repository? The 
latter goes away if I add the specified folder, but probably we should do 
something so it works out of the box.

Aside from that, everything looks basically okay to me.

One request though: could you change cleartk-temporal to cleartk-timeml? That 
package now includes stuff for both events and temporal relations so "temporal" 
is kind of a misnomer. I plan to reorganize those classes a bit when you're 
done with the reorg - all the Event* classes should be in a package 
org.cleartk.timeml.event, and all the VerbClause* classes should be in a 
package org.cleartk.timeml.tlink.

Original comment by steven.b...@gmail.com on 31 Dec 2010 at 12:06

GoogleCodeExporter commented 9 years ago
I'm not sure how to make the "missing artifact" problem go away.  Here is 
another way to get it to compile that may be helpful (from Marshall Schor):

Using m2eclipse Eclipse plugin for maven - by default it will "miss" these
generated directories the first time you import a project as a Maven project. 
However, the recovery is simple, and only needs doing once: right click the
project and select Maven -> update project configuration.

It would be nice if one didn't have to do it once - this is a pretty tricky 
detail that is easy to miss.  

I can change the project name.

Also, it just occurred to me that it might make sense to merge 
cleartk-test-util and cleartk-util.  I can't think of any good reason why we 
should maintain them separately.  I think every other project depends on both.  

Original comment by pvogren@gmail.com on 31 Dec 2010 at 4:14

GoogleCodeExporter commented 9 years ago
Update project configuration doesn't help. Again, I think the problem is that 
the pom for cleartk-ml-grmm is missing the cleartk repo declaration.

One reason to keep cleartk-test-util separate would be that as it is, we can 
declare it as a "test" dependency. If we merge the two, then we have to declare 
it as a "compile" dependency, and since it depends on "junit", we'll have to 
pull in "junit" as a "compile" dependency too.

Original comment by steven.b...@gmail.com on 31 Dec 2010 at 5:00

GoogleCodeExporter commented 9 years ago
Ok.  The merge is complete.  All the code is compiling without warnings and the 
tests are running both from maven on the command line and from within eclipse.  
I renamed cleartk-temporal to cleartk-timeml but otherwise didn't make any 
additional changes to that project.  I didn't make time to consider your point 
about mergine cleartk-test-util with cleartk-util.  More next week, er next 
year.  

Original comment by pvogren@gmail.com on 31 Dec 2010 at 9:19

GoogleCodeExporter commented 9 years ago
I am going to close this issue.  The dangling issue with possibly merging 
cleartk-test-util and cleartk-util can be raised in another issue if we decide 
this is a useful thing to do (and it seems that it is not.)  Any thing else 
related to the restructuring of the former project ClearTK-toolkit can be 
addressed in a separate issue. 

Original comment by pvogren@gmail.com on 5 Jan 2011 at 6:31

GoogleCodeExporter commented 9 years ago
Issue 54 has been merged into this issue.

Original comment by pvogren@gmail.com on 5 Jan 2011 at 6:34

GoogleCodeExporter commented 9 years ago

Original comment by pvogren@gmail.com on 14 Jan 2011 at 10:33