noushadali / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

restructure ClearTK modules to isolate dependencies better #264

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
So after working with the new release structure, I have some thoughts about how 
we could improve the module structure of ClearTK:

(1) When providing a wrapper to some library, we should make a single module 
named for that library. E.g. cleartk-clearparser instead of part of that being 
in cleartk-token and part of that being in cleartk-syntax-dependency-clear. 
Having it spread over many modules makes releasing more difficult because any 
time you upgrade that dependency you then have to release all the modules that 
have that dependency *and* all modules that depend on the modules with the 
dependency. So for example, when clearparser 0.4.0 is released in, say, 
January, we'll have to release pretty much a whole new ClearTK because 
cleartk-token currently depends on clearparser and almost everything depends on 
cleartk-token.

(2) We should separate out the type system from our type-specific annotators. 
Note that almost everything depends on mallet (and all its dependencies) and 
lucene-snowball because cleartk-token depends on them. Yet for something like 
cleartk-stanford-corenlp, this is totally unnecessary - all 
cleartk-stanford-corenlp needs is the type system, not all the annotators in 
there. The same is true for most of the wrapper modules - they have a bunch of 
extra dependencies because they depend on a module that has both machine 
learning and a type system together, while all they really want is the type 
system.

So here's a concrete proposal for a new module organization:

// basic util packages
cleartk-test-util - ClearTK testing infrastructure (same as now)
cleartk-util - ClearTK basic readers, annotators and utility code (same as now)

// machine learning packages, basically the same as now
cleartk-ml - ClearTK machine learning APIs (same as now, but with 
cleartk-chunker merged in)
cleartk-eval - ClearTK machine learning evaluation APIs (same as now)
cleartk-ml-libsvm - ClearTK wrapper for libsvm (same as now)
cleartk-ml-mallet - ClearTK wrapper for mallet (same as now, but with 
cleartk-ml-grmm merged in)
cleartk-ml-opennlp-maxent - ClearTK wrapper for opennlp-maxent (same as now)
cleartk-ml-svmlight - ClearTK wrapper for svmlight (same as now)
cleartk-ml-tksvmlight - ClearTK wrapper for tree kernel svmlight (same as now)

// type system and type utilities
cleartk-typesystem - Full ClearTK type system, extracted from cleartk-token, 
cleartk-syntax, etc. (the only Java code in this package would be utility code 
for operating on the type system)

// wrapper annotators that should only depend on the type system, not ML modules
cleartk-berkeleyparser - ClearTK wrapper for the Berkeley parser (a.k.a. 
cleartk-syntax-berkeley)
cleartk-clearparser - ClearTK wrapper for ClearParser and ClearParser tokenizer 
(a.k.a cleartk-syntax-dependency-clear and part of cleartk-token)
cleartk-maltparser - ClearTK wrapper for MaltParser (a.k.a. 
cleartk-syntax-dependency-malt)
cleartk-opennlp-tools - ClearTK wrapper for OpenNLP tools (a.k.a. 
cleartk-syntax-opennlp)
cleartk-snowball - ClearTK wrapper for the Snowball stemmer (was part of 
cleartk-token)
cleartk-stanford-corenlp - ClearTK wrapper for Stanford CoreNLP (same as now)

// home-grown annotators that depend on the type system and may depend on 
various ML modules
cleartk-token - ClearTK annotators for sentence segmentation and tokenization 
(no wrappers, only home-grown)
cleartk-named-entity - ClearTK annotators for named entity recognition (no 
wrappers, only home-grown)
cleartk-syntax - ClearTK annotators for syntactic parsing (no wrappers, only 
home-grown, with cleartk-syntax-dependency merged in)
cleartk-semantic-roles - ClearTK annotators for semantic role labeling (no 
wrappers, only home-grown)
cleartk-timeml - ClearTK annotators for temporal information extraction (same 
as now)
cleartk-examples - ClearTK example annotators (same as now)

Original issue reported on code.google.com by steven.b...@gmail.com on 24 Oct 2011 at 2:32

GoogleCodeExporter commented 9 years ago
Issue 257 has been merged into this issue.

Original comment by steven.b...@gmail.com on 24 Oct 2011 at 2:33

GoogleCodeExporter commented 9 years ago
Issue 258 has been merged into this issue.

Original comment by steven.b...@gmail.com on 24 Oct 2011 at 2:33

GoogleCodeExporter commented 9 years ago
This is nearly complete. I have added a wiki page, Modules, that describes the 
current organization and what's in each module.

There is one outstanding issue: ParentheticalAnnotator and it's UIMA type 
system Parenthetical.xml are in cleartk-util. Parenthetical.xml should be moved 
to cleartk-type-system, but then I don't know where to move 
ParentheticalAnnotator. I see the following options:

* A new module, say, cleartk-parentheses
* cleartk-syntax, since parentheticals are type of syntactic constituent
* cleartk-token, since it's already a grab-bag of sentence segmenters, 
tokenizers and part of speech taggers

I guess I lean towards the last option there, but I don't feel strongly about 
it. (I only feel strongly that we shouldn't have cleartk-util depend on 
cleartk-type-system.)

Original comment by steven.b...@gmail.com on 14 Nov 2011 at 1:35

GoogleCodeExporter commented 9 years ago
Moved ParentheticalAnnotator into cleartk-token in r3433.

Original comment by steven.b...@gmail.com on 27 Nov 2011 at 3:03