consider moving evaluation out of cleartk-ml into cleartk-ml-eval

noushadali / cleartk

Automatically exported from code.google.com/p/cleartk

0 stars 0 forks source link

consider moving evaluation out of cleartk-ml into cleartk-ml-eval #228

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

So I've been playing around with the evaluation package, and it feels pretty 
clunky to me in a number of ways. I don't like the idea of labeling it as 0.9.9 
(which is what it will get if it stays in cleartk-ml) because I think that may 
give the impression that the API are pretty much the way we want them. Could we 
move this out into a new module, say cleartk-ml-eval, and label it with 0.5.0 
or 0.7.0 or something a little less committal than 0.9.9?

Some things that I think are clunky:

* I don't like that EngineFactory forces you to use Descriptions. In many 
cases, it may be possible to use a single AnalysisEngine for both data writing 
and classifying by simply flipping the isTraining flag (e.g. via a few 
setConfigurationParameterValue calls and a reconfigure call). But to support 
this, the EngineFactory methods would need to return AnalysisEngines instead of 
AnalysisEngineDescriptions, so that you could have a single instance and just 
reconfigure it when the interface methods were called.

* There aren't any default implementations of anything, so it makes it hard to 
be confident we have the abstraction right. Some things I think we should 
implement before we declare the APIs satisfactory:
  * A FilesCorpusFactory that takes a list of training files and a list of testing files (to be read in as plain text using a FilesCollectionReader) and fills in all the CorpusFactory methods using this information
  * An AnnotationSpanEvaluationFactory that takes an annotation type, and compares the spans of the gold annotations of this type to the model annotations using precision, recall and F-measure. And probably similarly for an AnnotationAttributeEvaluationFactory using accuracy.

Don't get me wrong - I don't think the evaluation package is horribly broken or 
anything. I just think we should label it as something less than a 0.9.9 
release until we've pushed its boundaries a bit.

Original issue reported on code.google.com by steven.b...@gmail.com on 7 Feb 2011 at 9:14

GoogleCodeExporter commented 9 years ago

It has occurred to me a number of times recently that the evaluation code 
should sit in another module/project.  It seems that most evaluation code we 
create could be re-used for inter-annotator agreement and in that context there 
probably isn't any ML involved.  So, I would vote for cleartk-eval.  

Is your first concern related to performance?  

I agree that the version of the eval code shouldn't be the same as cleartk-ml.  
That said, I actually have 2 implementations of the interfaces - 1 for 
part-of-speech tagging (surprise!) and another for sentence segmentation.  I 
also have code that the needs to be refactored for evaluating coordination 
structures.  So, I have a bit more confidence that the interfaces as they are 
usable and I do have a fair bit of code lying around that will be added soon.  
I'm actually working on this now, sort of.  

I agree that a FilesCorpusReader makes sense to have  - but you might also 
consider using XmiCorpusFactory which I think is a better approach.

Original comment by phi...@ogren.info on 7 Feb 2011 at 9:44

GoogleCodeExporter commented 9 years ago

My concern is partly performance and partly ease of implementing the API. I 
feel like in most cases people have a single pipeline, and they just switch 
their cleartk annotator to isTraining=true during training and their cleartk 
annotator to isTraining=false during testing. I think we could accomplish this 
by working with a single pipeline directly and having the user implement 
methods like:

    configureForDataWriting(AnalysisEngine pipeline);

I also feel like the EvaluationFactory is not an ideal API. For every 
evaluation I can imagine, I just want to implement a method like:

    public void evaluation(JCas goldView, JCas systemView);

and maybe write results to stdout and/or a directory in 
collectionProcessComplete(). This should work exactly the same way for a cross 
validation - the evaluation infrastructure should stream the correct CASes from 
each fold through this one single annotator (not create the same evaluation 
annotator many times).

Original comment by steven.b...@gmail.com on 8 Feb 2011 at 7:23

GoogleCodeExporter commented 9 years ago

For comparison, here's the code that I ended up writing for cross-validation, 
which never re-creates an AnalysisEngine, and only reconfigures the 
AnalysisEngines that have to be reconfigured.

  public static void crossValidate(int totalFolds, EvaluationPipeline pipeline) throws Exception {

    // get the readers and engines
    CollectionReader reader = pipeline.getCollectionReader();
    AnalysisEngine[] trainEngines = pipeline.getTrainAnalysisEngines();
    AnalysisEngine[] testEngines = pipeline.getTestAnalysisEngines();

    // collect common meta data
    List<ResourceMetaData> metaData = new ArrayList<ResourceMetaData>();
    metaData.add(pipeline.getCollectionReader().getMetaData());
    for (Resource resource : pipeline.getTrainAnalysisEngines()) {
      metaData.add(resource.getMetaData());
    }
    for (Resource resource : pipeline.getTestAnalysisEngines()) {
      metaData.add(resource.getMetaData());
    }

    // repeat for each cross-validation fold...
    for (int fold = 0; fold < totalFolds; ++fold) {
      pipeline.configureForTraining(fold, totalFolds);

      // run through the training files
      CAS cas = CasCreationUtils.createCas(metaData);
      while (reader.hasNext()) {
        cas.reset();
        reader.getNext(cas);
        for (AnalysisEngine engine : trainEngines) {
          engine.process(cas);
        }
      }

      // close the training engines
      for (AnalysisEngine engine : trainEngines) {
        engine.collectionProcessComplete();
      }

      // train the model
      pipeline.train(fold, totalFolds);

      // configure for testing
      pipeline.configureForTesting(fold, totalFolds);

      // run through the testing files
      cas = CasCreationUtils.createCas(metaData);
      while (reader.hasNext()) {
        cas.reset();
        reader.getNext(cas);
        for (AnalysisEngine engine : testEngines) {
          engine.process(cas);
        }
      }

      // for testing, only a "batch" is done, not the whole collection
      for (AnalysisEngine engine : testEngines) {
        engine.batchProcessComplete();
      }
    }

    // close the test engines after all folds are complete
    for (AnalysisEngine engine : testEngines) {
      engine.collectionProcessComplete();
    }
  }

  public static interface EvaluationPipeline {
    public CollectionReader getCollectionReader();

    public AnalysisEngine[] getTrainAnalysisEngines();

    public AnalysisEngine[] getTestAnalysisEngines();

    public void configureForTraining(int fold, int totalFolds)
        throws ResourceConfigurationException;

    public void train(int fold, int totalFolds) throws Exception;

    public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException;
  }

As an example, here's what my implementation of configureForTesting looks like:

    public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException {
      // collect the test names for this fold
      List<String> foldTestNamesList = new ArrayList<String>();
      for (int i = 0; i < this.trainNames.size(); ++i) {
        if (i % totalFolds == fold) {
          foldTestNamesList.add(this.trainNames.get(i));
        }
      }

      // configure for testing
      reader.setConfigParameterValue(
          FilesCollectionReader.PARAM_FILE_NAMES,
          foldTestNamesList.toArray(new String[foldTestNamesList.size()]));
      reader.reconfigure();
      goldAnnotator.setConfigParameterValue(StoryEventGoldAnnotator.PARAM_VIEW_NAME, "GoldView");
      goldAnnotator.reconfigure();
      systemAnnotator.setConfigParameterValue(CleartkAnnotator.PARAM_IS_TRAINING, false);
      systemAnnotator.setConfigParameterValue(
          GenericJarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH,
          new File(this.getFoldDirectory(fold, totalFolds), "model.jar").getPath());
      systemAnnotator.reconfigure();
    }

Basically I make some setConfigParameterValue calls followed by reconfigure 
calls for the annotators that need it. (If you look closely, you'll see I had 
to add a parameter CleartkAnnotator.PARAM_IS_TRAINING, but I think that's 
probably a useful parameter that we should add anyway.)

Original comment by steven.b...@gmail.com on 8 Feb 2011 at 4:30

GoogleCodeExporter commented 9 years ago

So, here are a few things to consider:

The difference between the pipeline that trains and the one that classifies 
might be quite small as you mention.  It might also be quite large.  You might 
have a lot of gold-standard data that you use for training (like tokens, 
sentences, pos tags, depedendencies, and semantic roles, for example.)  When 
you classify you may not assume any of these things.  For my coordination work 
I had three different corpora that had gold-standard coordination structures 
but differing amounts of other kinds of gold-standard data.  One corpus had 
nothing but coordination structures (i.e. no tokens, no sentences, no 
part-of-speech tags.)  And, of course, for the two corpora that had more data I 
experimented with using an automatic tokenizer and gold-standard tokens.  
Another difference between the two pipelines is that one operated on the gold 
view and the other worked on the system view and was begun with 
ViewTextCopierAnnotator(sp).  I also had some pre-computed part-of-speech tags 
that were derived from "leave-one-file-out" tagger training across 36 files.  
When I ran my training pipeline and I wanted to use these tags the pre-computed 
tags were copied over the gold-standard tags for that run.  When I ran the 
classifying pipeline I would copy the pre-computed tags from the gold view to 
the tokens in system view.  So, while the two pipelines have a lot of 
similarities and they used a lot of the same convenience methods I created for 
setting them up - there were enough differences between the two that I was glad 
they were created by separate methods and I would not like to see them put back 
together.  

Larry's group does not use views to separate gold annotations from system 
annotations - they do this using the type system.  They have a notion of an 
annotation set.  So, your proposed interface will not work for them.  I like 
the idea of having a default implementation that has something like this.  

I do not like the idea of pushing all folds through the same annotator.  With 
the current interfaces it would be trivial to extend 
org.cleartk.evaluation.Evaluation (see runCrossValidation) so that you could 
run one fold at a time.  This is nice for parallelization.  It's great when you 
can run an entire cross-validation experiment on a single machine in a single 
run.  But we should make it really easy to run each fold on 10 different 
machines - and not force sequential processing where it is not required.  Also, 
it's nice to run a single fold - for whatever reason.  

I also like having each fold have a complete set of evaluation results.  I 
suppose you can have collectionProcessComlete() called multiple times (?) - but 
to me it feels right that a single fold is a complete pipeline in and of 
itself.

Original comment by phi...@ogren.info on 8 Feb 2011 at 4:56

GoogleCodeExporter commented 9 years ago

So I'm not arguing that the interface I ended up with is the right one for 
every task. What I'm saying is that the current evaluation interfaces make it 
*impossible* to do what I've done. I think the evaluation interfaces should 
make both my use case and your use cases possible. So the goal here should be 
to develop some interface that meets both of our needs.

A few specific comments:

"Larry's group does not use views to separate gold annotations from system 
annotations - they do this using the type system.  They have a notion of an 
annotation set.  So, your proposed interface will not work for them" I think 
that's not true. There's nothing about the EvaluationPipeline interface I 
proposed in Comment 3 that would keep them from doing that. In fact, with the 
EvaluationPipeline interface, there's no such thing as a special 
EvaluationFactory interface - you just create a normal AnalysisEngine and it 
does whatever you want. In my case, it compared two views, but in Larry's case, 
it would compare the differently typed annotations.

"I do not like the idea of pushing all folds through the same annotator." I 
like pushing them all through the same annotator because it simplifies the 
evaluation code - it can all be in the same plain old AnalysisEngine. Our APIs 
should support both use cases.

"I also like having each fold have a complete set of evaluation results." This 
is absolutely possible (and I'm doing it right now) with my proposed API - each 
fold is considered a "batch" and batchProcessComplete is called after each 
fold. So for me, fold = batch and cv = collection, while for you fold = 
collection and cv = multiple collections.

Original comment by steven.b...@gmail.com on 8 Feb 2011 at 5:44

GoogleCodeExporter commented 9 years ago


Somehow I didn't see the code in comment 3 when I posted comment 4 earlier.  

The main, if not only, advantage I see is that when you run 
collectionProcessComplete you can write out all of the evaluation results that 
you've accumulated for all n folds rather than having to "aggregate" them (i.e. 
read results from individual folds from respective directories) with 
EvaluationFactory.aggregateEvaluationResults().  I agree that this is nice when 
everything can be run sequentially on a single box.  But if you get five folds 
in and it blows up - can you recover the evaluation results that you need from 
the four folds that completed successfully?  It doesn't seem very natural.  
With my solution, the evaluation engine writes out everything it needs to so 
that the results can be aggregated later.  I suppose I could still do this.   

I'm not crazy about having the collection reader coupled with the evaluation 
pipeline.  I generally pre-process my corpus so that I have XMI files.  
However, for one corpus I didn't (for some very specific reason I can't 
remember now) and it requires special reading and pre-processing.  Do I have to 
have a new implementation of EvaluationPipeline for this other corpus?  It 
seems so.  

Also, I like the simplicity of Evaluation.buildCorpusModel().  I think that 
there should be an obvious way to build *the* model for a corpus.  

Similarly, I would like to keep the datawriting or classifying analysis 
engine(s) decoupled from the evaluation.  Suppose you finish evaluation and you 
want to build *the* model and then tag a whole bunch of unlabeled text.  With 
the current interface you just pass in a new collection reader and get an 
aggregate from EngineFactory.createClassifierAggregate() and go ahead and tag a 
bunch of text without worrying about an evalation AE blowing up because 
gold-standard data isn't available.  This may seem out-of-scope for the project 
- but it seems so easy to do here since all the pieces are available.  But why 
rebuild the part of the pipeline generated by your getTestAnalysisEngines() 
method somewhere else?

My feedback probably all sounds negative and/or defensive.  I actually get what 
you are doing now and somewhat like it (the code helps!)  I agree that we could 
just as easily have e.g. EngineFactory return an AnalysisEngine instead of an 
AnalysisEngineDescription.  I'm not clear why it needs to be an array if we 
have aggregates.  

I think the main differences between our approaches are:

- My interfaces returns an AnalysisEngineDescription and yours returns an 
AnalysisEngine[].  I am fine with the latter but would prefer it not to be an 
array since aggregates are so easy to build and this simplifies the code of the 
caller.  
- My interfaces decouples the collection reader from the the 
training/classifying aggregates, from the evaluation AEs.  Your interface puts 
them all together.  I am not very happy about this.
- Your interface makes intuitive use of batchProcessComplete and 
collectionProcessComplete.  I like this a lot - esp. for aggregating the 
results of the evaluation AEs.  I like my approach too - but I think this is a 
point where both approaches ought to be accomodated.  
- My approach (see Evaluation) imposes/assumes some structure of the folders 
that are produced during cross-validation.  In particular, it assumes folders 
at all (see nearly every interface method signature, except, ironically 
CorpusFactory.)  Your approach does not.  My solution could be abstracted 
better to avoid assuming directories for everything.  It makes just as much 
sense to have a scenario in which all evaluation for every fold and the 
aggregate results are all written to a single file.  Your approach facilitates 
this.  

Hopefully, this is a fairly comprehensive listing of the issues.

Original comment by phi...@ogren.info on 8 Feb 2011 at 7:32

GoogleCodeExporter commented 9 years ago

I'm not committed to AnalysisEngine[] vs. AnalysisEngine. I just used 
AnalysisEngine[] because that's what made the user's code the simplest. It's 
not much overhead for the user to wrap that in an aggregate.

I have no problem splitting the CollectionReader stuff from the AnalysisEngine 
stuff. In my case, I'll probably just have one class implement the two 
interfaces, but that's easy enough. I was considering factoring that out 
because it would be natural to have an implementation based on a 
FilesCollectionReader that does all the file name splitting my code above did.

I also think the core interfaces should not assume a directory. That doesn't 
mean the various annotators shouldn't use one - e.g. it would be trivial to add 
a PARAM_OUTPUT_DIRECTORY to my SpanEvaluationAnnotator so that it writes 
results for each fold (and for the overall evaluation) to the directory. But 
the basic evaluation interfaces should not force a directory upon you.

Other specific comments:

"But if you get five folds in and it blows up - can you recover the evaluation 
results" - sure, in the same way you do now - add a configuration parameter 
that gives you an output directory, and write intermediate results there.

"I like the simplicity of Evaluation.buildCorpusModel()" - In general, I think 
we should be trying to make user code simpler, and not worry too much how 
complicated our code looks. That said, the EvaluationPipeline equivalent would 
look like:

  public static void train(EvaluationPipeline pipeline) throws Exception {
    CollectionReader reader = pipeline.getCollectionReader();
    AnalysisEngine[] engines = pipeline.getTrainAnalysisEngines();
    pipeline.configureForTraining();
    SimplePipeline.runPipeline(reader, engines);
    pipeline.train()
  }

So that's certainly no more complicated than Evaluation.buildCorpusModel().

"I would like to keep the datawriting or classifying analysis engine(s) 
decoupled from the evaluation" - I would say that the datawriting and 
classifying analysis engines are 100% decoupled from the evaluation with the 
EvaluationPipeline interface - there is *no* assumption that you'll actually 
run an evaluation annotator, and no API required for such an annotator. You can 
do whatever you want in the evaluation annotator, including not adding one at 
all.

"Suppose you finish evaluation and you want to build *the* model and then tag a 
whole bunch of unlabeled text... This may seem out-of-scope for the project" 
Yes, I think this is out of scope for the evaluation project. How hard is it to 
create an separate aggregate for tagging new text when we're already creating a 
separate aggregate for training and testing?

Original comment by steven.b...@gmail.com on 8 Feb 2011 at 11:38

GoogleCodeExporter commented 9 years ago

I spent some time trying to refactor my solution to accommodate your use case.  
Attached is a diff file for the cleartk-ml project.  The javadocs are out of 
date hear so don't squint too hard at them.  There's a lot to say 
point-by-point about what I did here - but I don't have time to comment 
extensively just now.

Original comment by phi...@ogren.info on 9 Feb 2011 at 12:10

Attachments:

issue228-pvo-patch.diff

GoogleCodeExporter commented 9 years ago

Ok, I think we're converging. ;-) I refactored my solution to look a bit more 
like yours. Here are the interfaces I ended up with (an Evaluation class and a 
sample FilesReaderFactory implemented on top of these are attached):

public interface ReaderFactory {
  public CollectionReader getTrainFoldReader(int fold, int totalFolds) throws UIMAException;
  public CollectionReader getTestFoldReader(int fold, int totalFolds) throws UIMAException;
  public CollectionReader getTrainReader() throws UIMAException;
  public CollectionReader getTestReader() throws UIMAException;
  public CollectionReader getReader() throws UIMAException;
}

public interface EngineFactory {
  public AnalysisEngine[] getTrainFoldDataWriterEngines(int fold, int totalFolds) throws UIMAException;
  public AnalysisEngine[] getTestFoldClassifierEngines(int fold, int totalFolds) throws UIMAException;
  public AnalysisEngine[] getTrainDataWriterEngines() throws UIMAException;
  public AnalysisEngine[] getTestClassifierEngines() throws UIMAException;
  public AnalysisEngine[] getDataWriterEngines() throws UIMAException;
  public void trainFold(int fold, int totalFolds) throws Exception;
  public void train() throws Exception;
}

So what are the remaining differences?

(1) I ended up still returning AnalysisEngine[] instead of a single 
AnalysisEngine because as far as I can tell, there's no way with UimaFIT to 
create an aggregate from an AnalysisEngine[], only from an 
AnalysisEngineDescription[], and for my use case it has to be AnalysisEngines. 
I don't know enough about aggregate AnalysisEngines to know whether this is 
just not possible in UIMA or if we're just missing the convenience method in 
UimaFIT.

(2) I don't have any use for EvaluationFactory, so my Evaluation methods don't 
take one. This is easily solved - we should just have two versions of each 
method, one that takes an EvaluationFactory and one that doesn't.

(3) My EngineFactory doesn't assume that you'll be writing to a directory. So 
the EngineFactory methods parallel the ReaderFactory methods, e.g. 
getTrainFoldDataWriterEngines(int, int) parallels getTrainFoldReader(int, int). 
In your solution, EngineFactory.createDataWritingAggregate and 
EngineFactory.createClassifierAggregate both take a File parameter. (Note also 
that these two methods are also broken for my use case because they return 
AnalysisEngineDescriptions, not AnalysisEngines.)

(4) I don't have a getPreprocessor() method. I don't really understand the 
point of this method - if you have a preprocessor, why aren't you just 
including that in the aggregates created by createDataWritingAggregate() and 
createClassifierAggregate()?

(5) My ReaderFactory doesn't assume that there's a single set number of 
cross-validation folds for each corpus. So instead of making a corpus define 
CorpusFactory.numberOfFolds(), I just pass the total number of folds to all the 
get*FoldReader methods. This allows the user to decide how many folds they want 
at the time that they call Evaluation.crossValidate instead of hard-coding it 
in the CorpusFactory.

(6) In Evaluation.crossValidate, I still have 1 fold = 1 batch and all folds = 
1 collection, while you still have 1 fold = 1 collection and all folds = many 
collections. This is the one issue I'm not sure how to resolve. Perhaps we 
should add two versions of the crossValidate method, one for each 
interpretation.

(7) Our naming schemes are a bit different. These are all minor points compared 
with the above, but I'll mention them here anyway. I like ReaderFactory better 
than CorpusFactory because it's returning CollectionReader objects, not Corpus 
objects, and I like having "Fold" or something like that in the methods like 
getTrainFoldReader or getTestFoldClassifierEngines because it makes it clearer 
to me that these methods are used for cross validations.

Original comment by steven.b...@gmail.com on 9 Feb 2011 at 11:19

Attachments:

GoogleCodeExporter commented 9 years ago

(1) I am fine with returning an array

(2) fine by me.  I like the evaluation factory.  I think the evaluation 
pipeline is decidedly different and separate from the training/classifying 
pipeline.  My evaluation pipeline for my coordination task has 3 AEs in it.  I 
also have two approaches for coordination that have a completely different set 
of AEs.  So, I do not want to deal with cross-product issues either.  

(3) No.  The EngineFactory was refactored in the patch I attached to comment 8. 
 The methods do not return AEDs and they do not take a directory.  

(4) Preprocessing is specific to the corpus you are dealing with.  If you are 
using XMI files, then you typically don't need this method because you have 
done all your preprocessing ahead of time.  However, if you are dealing with 
some native file format for some annotated corpus, you will likely want to run 
some preprocessing annotators that, for example, parse XML (or PTB, etc.) into 
the annotations that are needed downstream.  Corpora that ultimately provide 
the exact same kinds of annotations can come in many formats....

(5) I knew we would disagree about this one.  I think that the number of folds 
should be determined by the corpus factory regardless of whether this is 
decided at runtime (this could be accomplished by an initialization parameter) 
or hard-wired in (for corpora where the folds are fixed.)  Not every corpus is 
easily split into an arbitrary n number of folds.  I don't understand why 
EngineFactory needs to know how many folds there are.  

(6) In the patch I added to comment 8 I created a "complete" method for a given 
fold which is implemented by the EngineFactory.  In your solution this method 
would call batchProcessComplete there.  There is also an aggregateResults 
method where you would call collectionProcessComplete.  I haven't decided 
exactly how I will use these two methods for the case where you write out 
evaluation data to be read in later for aggregating.  I will say that after 
scrutinizing your crossValidate method in comment three - I do not think your 
use of batchProcessComplete and collectionProcessComplete is very consistent or 
intuitive.  Why should a user call collectionProcessComplete for the training 
engines for each fold but call batchProcessComplete for the testing engines for 
the same folds?  This is very confusing I think.  This is another reason to 
abstract away these calls and put them in the factory as I have done here.

(7) I like "corpus" because it communicates a specific set of data that 
generally has a name and specific properties.  How about CorpusReaderFactory?  
I wasn't consistent with my naming - I used runName and foldName where I meant 
to use runName consistently.  I'm ambivalent about this.  I am confused about 
what the name should be because there is no "fold" per se when you run holdout 
evaluation.  

I am not crazy about "EngineFactory" - I would be happy to hear other 
suggestions.

Original comment by phi...@ogren.info on 9 Feb 2011 at 4:12

GoogleCodeExporter commented 9 years ago

(3) Yep, I misread the diff. They don't take a directory any more - they take a 
name string. The assumption is that the AnalysisEngine pipeline should never 
need to know whether it's being used for regular training or for cross 
validation training. I was trying not to make any assumption like that, but I 
guess it's not a crazy assumption to make. So I guess I'm okay with this. (Note 
that this answers part of your (5) question, why EngineFactory took the fold 
number.)

(4) I'm not really sold on the 1 corpus = 1 preprocessor. How I use the corpus 
depends on the preprocessing - if I want to use Penn Treebank to train a part 
of speech tagger, I may not want to load a bunch of trees. I think the 
preprocessing is pipeline dependent, not corpus dependent. But I guess it 
doesn't really hurt as long as it gets some better documentation explaining 
when you would put something with the corpus reader factory and when you would 
put something with the EngineFactory.

(5) But why not just throw an exception if you get a number of folds that is 
invalid for your corpus? You have to do that somewhere anyway. I guess I've 
just never worked with a corpus before that had set cross-validation folds, so 
conceptually it seems like the wrong place to put that.

(6a) "I created a "complete" method for a given fold which is implemented by 
the EngineFactory". You mean EvaluationFactory, right? So for my use case, how 
does the EvaluationFactory get access to the test engine to call 
batchProcessComplete in "complete" and collectionProcessComplete  in 
"aggregateResults"?

(6b) "Why should a user call collectionProcessComplete for the training engines 
for each fold but call batchProcessComplete for the testing engines for the 
same folds?" Because you want to train N classifiers but you only want 1 final 
evaluation number. So there should be N collectionProcessComplete calls for 
training, and 1 collectionProcessComplete call for testing. But we don't have 
to agree on this as long as we support both approaches.

(7)
"CorpusReaderFactory" - better for sure, though not amazing. ;-)
"runName" - I think you should just call it "name" or "id" - it's just an 
arbitrary identifier made up by Evaluation, so reading anything more into it 
would be a mistake.
"EngineFactory" - yeah, I didn't like this either because it has a "train" 
method.

Original comment by steven.b...@gmail.com on 9 Feb 2011 at 5:33

GoogleCodeExporter commented 9 years ago

(4) You can imagine that you might have a PTBCorpusReader_ImplBase from which 
two implementations that have different preprocess() implementations.  

(5) My preference is to split my corpus up into fixed folds for maximum 
repeatability.  So, that's what I've always done.  It seems that some 
FileCorpusReader_ImplBase could be initialized with the number of folds at 
runtime and you could push the logic you had in configureForTesting in comment 
3 into that class.  

(6a) my mistake.  Yes - the evaluation factory

(6b) this is confusing.  

(7) yes - 'name' is better.  One candidate might be CleartkAnnotatorFactory - 
it seems a bit narrow for what it does (considering it has a train and could 
have any number of preprocessing AEs for the target CleartkAnnotator) - but it 
communicates better than EngineFactory.  Open for suggestions.

Are we still disagreeing on anything?  I think I will commit my diff since it 
seems to be closer to a solution than the current code and we can go from 
there.

Original comment by phi...@ogren.info on 9 Feb 2011 at 6:17

GoogleCodeExporter commented 9 years ago

Yes, please commit your current diff. And if it's not too much work, move 
things to cleartk-eval with maybe version 0.7.0 or something like that? I think 
we're converging to a good solution here, but we should definitely give it a 
bit of time and experience before we mark it 0.9.9.

For (5), how about we have a getNumberOfFolds and setNumberOfFolds? That would 
make it feel less like there was a single "correct" number of folds for a given 
corpus, and you could put in setNumberOfFolds any code you need for checking 
that you've received a valid number of folds.

For (6), I'm still not sure how my EvaluationFactory is supposed to get to the 
AnalysisEngines to call collectionProcessComplete and/or batchProcessComplete. 
This is the one serious problem remaining for me.

For (7), I'm not convinced yet that CleartkAnnotatorFactory is better than 
EngineFactory, so I guess just leave it for the moment.. I'll try to brainstorm 
for better names.

Original comment by steven.b...@gmail.com on 9 Feb 2011 at 9:18

GoogleCodeExporter commented 9 years ago

(repeated from email I meant to post as a comment)

I committed the code earlier today. 

I will also create the new project and move it over.  

(5) sounds good.

(6) look at how Evaluation.java calls EvaluationFactory.writeResults() and 
EvaluationFactory.aggregateResults().  (I just renamed writeResults in a commit 
from a couple minutes ago.)

(7) I renamed EngineFactory to CleartkAnnotatorFactory before I committed the 
code.  I will leave it that way until we come up with a better name.  Given 
that everything called "Factory" has no "create" methods anymore we should stop 
calling them factories.  I described it to Chris and he suggested "Cache".  
Maybe "Provider" would work.  I sort of like that better.  Here's a straw man:

CorpusReaderProvider
CleartkPipelineProvider
EvaluationProvider

Original comment by phi...@ogren.info on 9 Feb 2011 at 10:28

GoogleCodeExporter commented 9 years ago

(5) Since I have a corpus of tutorial dialogue transcripts, I was thinking that 
instead of n-fold cross-validation, I would be doing leave-one-student-out 
cross-validation.  How does that play with this situation?  Is it the same as 
when I have pre-specified folds?

Original comment by lee.becker on 9 Feb 2011 at 10:33

GoogleCodeExporter commented 9 years ago

@Philip:

(6) I still don't understand how I can implement 
EvaluationFactory.writeResults() so that it calls batchProcessComplete() - the 
EvaluationFactory doesn't have any access to the pipeline, only the 
EngineFactory has access to the pipeline. Or are you suggesting that I write my 
own Evaluation.java that calls batchProcessComplete() instead of 
EvaluationFactory.writeResults() and collectionProcessComplete() instead of 
EvaluationFactory.aggregateResults()? (Hopefully not - I think we should 
arrange the API so that Evaluation.java works for all our use cases.)

(7) +0.5 on all the Provider names - definitely the best I've seen from what we 
had so far.

@Lee

(5) If you know you'll only ever do leave-one-out cross-validation on that 
corpus, then define getNumberOfFolds() to return the number of transcripts and 
setNumberOfFolds() to throw UnsupportedOperationException. If you think you 
might do other types of cross-validation, but you want to make it easy to do 
leave-one-out cross-validation when you want to, then you might add to that 
specific CorpusReaderProvider a method like getNumberOfTranscripts() which you 
can then pass to setNumberOfFolds().

Original comment by steven.b...@gmail.com on 9 Feb 2011 at 11:05

GoogleCodeExporter commented 9 years ago

as long as you know how many "instances" you have then it should be easy to 
specify the number of folds for leave-one-out.  Still, it seems that there 
should be an ImplBase that does this for you.  This should be possible with the 
current interface.

Original comment by phi...@ogren.info on 9 Feb 2011 at 11:07

GoogleCodeExporter commented 9 years ago

(5) I should clarify.  I am doing dialog move classification on the turns 
within a transcript.  Also a given student can have multiple transcripts, 
consequently my folds will be designed to ensure I'm not training and testing 
on the utterances of a student.  In my case leave-one-out is not 
leave-one-instance out but leave all of a student's utterance instances out.

Original comment by lee.becker on 9 Feb 2011 at 11:14

GoogleCodeExporter commented 9 years ago

@Steve:
(6) right.  I forgot that you are going to put your evaluation AEs in the list 
of analysis engines returned by (current working name) CleartkPipelineProvider. 
 This would work if you would be willing to implement an (current working name) 
EvaluationPipelineProvider.  The idea is that your implementation of 
EvaluationPipelineProvider would call batchProcessComplete in writeResults on 
those analysis engines that that need to write evaluation data.  

Let me fix up the code that I'm working on assuming you would be willing to do 
this so that we have a version we can look at with the current names and some 
of the additional changes above.  If you can't warm up to the notion of using 
the EvaluationPipelineProvider, then we will have to figure out how to give you 
the hooks that you need.

Original comment by phi...@ogren.info on 9 Feb 2011 at 11:16

GoogleCodeExporter commented 9 years ago

Yep, fix up the code and I'll take a look at it tomorrow.

I was already assuming i would implement a EvaluationPipelineProvider to call 
batchProcessComplete and collectionProcessComplete (which I'm fine with 
implementing) but I'm still not clear on how that EvaluationPipelineProvider 
would get access to the pipeline that was being finished because writeResults 
takes only a String parameter, not an AnalysisEngine or AnalysisEngine[] 
parameter.

I'm pretty sure that even for the other use cases, your code isn't even calling 
collectionProcessComplete() for the "classifierAggregate". (It is for the 
"dataWritingAggregate" implicitly through SimplePipeline.runPipeline though.) 
At first I though that was because you intended to call 
collectionProcessComplete() from evaluationFactory.writeResults() but that's 
not possible with the current API.

(Also, we probably shouldn't use JCasIterable or SimplePipeline.runPipeline in 
Evaluation.java - these are convenience methods that swallow some UIMA 
exceptions and turn them into RuntimeExceptions, which is fine for a simple 
main method, but probably shouldn't be used in other code. We could fix 
SimplePipeline.runPipeline to not use JCasIterable, and perhaps we should do 
that, but there's no real way to fix JCasIterable, so we should just avoid 
using it for anything but simple main methods.)

Original comment by steven.b...@gmail.com on 9 Feb 2011 at 11:29

GoogleCodeExporter commented 9 years ago

I have committed all the changes that I have for now.  Please scrutinize.  

I think I am both confused and confusing.  Because EvaluationPipelineProvider 
produced a list of analysis engines for a given name it can call 
batchProcessComplete (as it sees fit) on the same analysis engines (since it is 
given the name.)  However, callers of aggregateResults will have to be aware 
that the EvaluationPipelineProvider will aggregate results for whatever 
pipelines it has served up - i.e. you wouldn't want to run the holdout 
evaluation and then cross validation because then the aggregation will include 
evaluation data from both.  Maybe we should have a reset() method of some sort? 

I cleaned up the code a bit so that it isn't calling SimplePipeline any more 
and it is clear where I am calling collectionProcessComplete and where I'm not. 
 That said, the code is wrong and I am not sure what to do.  I rather think 
that we shouldn't be calling collectionProcessComplete on any of the analysis 
engines and that we should send this back to the providers to do as they see 
fit.  So, this makes me think that CleartkPipelineProvider needs methods 
analogous to writeResults() and aggregateResults().  It would be nice if the 
methods were the same for both.  Maybe pipelineComplete(name) and 
evaluationComplete()?

Original comment by phi...@ogren.info on 10 Feb 2011 at 1:06

GoogleCodeExporter commented 9 years ago

I just noticed a potential hole in our CleartkPipelineProvider.  The train 
method accepts a name and a single set of training arguments.  This assumes 
that there is a single model being built, or that all the models being built 
will take the same training arguments.  This is fine for many scenarios (both 
of which I've found myself in.)  But I can imagine that for Lee's use - he is 
going to want to train different models using different learners each with 
different training arguments to evaluate his system.  This is true not just for 
CleartkMultiAnnotator but also for cases where a pipeline has multiple 
CleartkAnnotators.  I suppose since he will be implementing the train method 
for his own evaluation infrastructure he can specify the strings to be whatever 
he wants them to be - perhaps using Args4j to parse argument variables.

Original comment by phi...@ogren.info on 10 Feb 2011 at 1:08

GoogleCodeExporter commented 9 years ago

While I would love the flexibility to tweak each classifier, it doesn't seem 
practical to be tweaking the learner parameters for 30+  different classifiers. 
  As it stands now, CleartkMultiAnnotator only takes one learner, and we would 
need a more sophisticated mechanism to pair each classifier with a different 
learning algorithm.

FWIW, I will probably focus more on experiments (in the near term at least) 
where I turn sets of features off globally for all of my classifiers.

Original comment by lee.becker on 10 Feb 2011 at 1:19

GoogleCodeExporter commented 9 years ago

Ok, I see how it works now. EvaluationPipelineProvider.getEvaluationPipeline 
doesn't provide the whole testing + results pipeline, just the evaluation bit 
at the end.

I tried to update my code to use the new interfaces. Some problems I ran into:

= get*Reader call order =
Evaluation is currently broken for my use case because it calls getTrainReader, 
then getTestReader, and then runs pipelines. But since my train and test 
readers are the same reader, just reconfigured, the train pipeline actually 
gets the test reader files, and the test pipeline gets no CASes at all because 
they were used up during training. The rule should be "never call get*Reader 
until the moment you're actually going to use it". Basically, we need to get 
rid of Evaluation.run, which violates this assumption.

= Fold Numbering =
runCrossValidation assumes that folds are numbered starting with 1. This seems 
inconsistent with standard Java conventions. Can we have 0 based indexing here?

= Declared Exceptions =
We need to think about what exceptions all the methods should throw. For 
example, for my use case, I'm typically calling 
ConfigurableResource.reconfigure() in all the CorpusReaderPipeline methods and 
the the CleartkPipelineProvider.get*Pipeline methods, and that throws a 
ResourceConfigurationException, not a ResourceInitializationException. As 
another example, in EvaluationPipelineProvider.writeResults, I'm calling 
batchProcessComplete, and that throws a AnalysisEngineProcessException. Perhaps 
all the methods should be throwing UIMAException? The one exception would be 
org.cleartk.eval.provider.CleartkPipelineProvider.train which should throw 
Exception to match JarClassifierBuilder.trainClassifier.

= Evaluation.buildCorpusModel =
I think Evaluation.buildCorpusModel still has the wrong signature. Why is there 
a File there? I think it should just be a String id (a.k.a. runName) like 
everywhere else. Also, Evaluation.buildCorpusModel shouldn't be calling 
dataWritingPipeline.add - we shouldn't be modifying the list returned by 
getTrainingPipeline.

= JarClassifierBuilder.trainAndPackage =
I don't think Evaluation.train should delegate to Train.main. Instead, it 
should just use the following code:

    JarClassifierBuilder<?> classifierBuilder = JarClassifierBuilder.fromTrainingDirectory(modelDirectory);
    classifierBuilder.trainClassifier(modelDirectory, trainingArguments);
    classifierBuilder.packageClassifier(modelDirectory);

Perhaps instead of having Evaluation.train, we should add a static method to 
JarClassifierBuilder? I mean, if we're going to assume that it's a 
JarClassifierBuilder directory (as Train.main does), then JarClassifierBuilder 
is probably the right place for that method.

= collectionProcessComplete =
"CleartkPipelineProvider needs methods analogous to writeResults() and 
aggregateResults() ... Maybe pipelineComplete(name) and evaluationComplete()?" 
Yes, this sounds good to me, but in your use case you don't keep a reference to 
the training and evaluation pipelines around, so you won't be able to call 
collectionProcessComplete from just a String identifier.  So I think the 
signatures need to be:

    void trainingPipelineComplete(String, List<AnalysisEngine>)
    void classifyingPipelineComplete(String, List<AnalysisEngine>)
    void evaluationPipelineComplete(String, List<AnalysisEngine>)

These methods would be called every time you go through a pipeline, e.g. once 
for a train/test run, and once for each fold for a cross-validation run. We'd 
also need the methods:

    void trainingComplete()
    void classifyingComplete()
    void evaluationComplete()

These methods would be called exactly once, at the end of runCrossValidation(), 
runHoldoutEvaluation() and buildCorpusModel(). I think that should work for 
both of our use cases:

* For training and classifying, I would call collectionProcessComplete() in 
trainingPipelineComplete() and classifyingPipelineComplete(). For evaluation, I 
would call batchProcessComplete() in evaluationPipelineComplete() and 
collectionProcessComplete() in evaluationComplete().

* You would call collectionProcessComplete() in all the xxxPipelineComplete() 
methods, and do whatever you used to do in aggregateResults() in 
evaluationComplete().

Attached is a patch which makes all the changes proposed above.

Original comment by steven.b...@gmail.com on 10 Feb 2011 at 12:22

Attachments:

cleartk-eval.diff

GoogleCodeExporter commented 9 years ago

= get*Reader call order =
My bad.  Looks better now.

= Fold Numbering =
true.  I wanted to have the folder name of the first fold to be "fold-1" 
instead of "fold-0".  This can easily be done in createFoldName (which has a 
minor bug related to this issue - either the digit that is used for the file 
name needs to be incremented or the total folds needs to be decremented.)  I 
will fix the method.  Btw - I asked you for that string formatting trickiery a 
few weeks ago and you didn't know what I was talking about....  ;)

= Declared Exceptions =
sounds good.

= Evaluation.buildCorpusModel =
I forgot to revisit this method.  looks better now.  

I agree with the other points you make above.

Hopefully, we haven't worked too hard to accommodate our two use cases.  The 
code doesn't look all that complicated - so hopefully the abstraction is good.  
I will have a better feeling about it after I port my evaluation code to this.  
We definitely need to document this code extensively before closing this issue.

I am refactoring XmiCorpusFactory so that it doesn't assume fixed folds.  I 
will commit this today.

Original comment by phi...@ogren.info on 10 Feb 2011 at 6:22

GoogleCodeExporter commented 9 years ago

Yep, let me know how your port goes. If everything works out with your port, 
we'll go for a documentation spree on these interfaces.

"createFoldName (which has a minor bug related to this issue - either the digit 
that is used for the file name needs to be incremented or the total folds needs 
to be decremented.)"  Are you sure? It looks right to me:

scala> import org.cleartk.eval.Evaluation.createFoldName
import org.cleartk.eval.Evaluation.createFoldName

scala> for (i <- 0 until 9) yield createFoldName(i, 9)  
res1: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2, 
3, 4, 5, 6, 7, 8)

scala> for (i <- 0 until 10) yield createFoldName(i, 10)
res2: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2, 
3, 4, 5, 6, 7, 8, 9)

scala> for (i <- 0 until 11) yield createFoldName(i, 11)
res3: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(00, 01, 
02, 03, 04, 05, 06, 07, 08, 09, 10)

But yes, I'm fine with you modifying it so that it numbers from 1 instead of 0 
when it creates the name.

"I asked you for that string formatting trickiery a few weeks ago and you 
didn't know what I was talking about". Actually you asked me about "creating a 
string of n repeating characters" which is different from asking to fill in 
zeros before an integer. There's a special string format operation for 
prefixing zeros (e.g. %04d means 4 digits, filling with zeros as necessary) 
while there's no special string format operation for the general case.

Original comment by steven.b...@gmail.com on 10 Feb 2011 at 6:48

GoogleCodeExporter commented 9 years ago

When I run EvalationTest I get the following error:

java.util.DuplicateFormatFlagsException: Flags = '0'
    at java.util.Formatter$Flags.parse(Formatter.java:4140)
    at java.util.Formatter$FormatSpecifier.flags(Formatter.java:2555)
    at java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2625)
    at java.util.Formatter.parse(Formatter.java:2480)
    at java.util.Formatter.format(Formatter.java:2414)
    at java.util.Formatter.format(Formatter.java:2367)
    at java.lang.String.format(String.java:2769)
    at org.cleartk.eval.Evaluation.createFoldName(Evaluation.java:114)
    at org.cleartk.eval.EvaluationTest.testCreateFoldName(EvaluationTest.java:43)

Original comment by phi...@ogren.info on 10 Feb 2011 at 7:34

GoogleCodeExporter commented 9 years ago

Did you mean to name org.cleartk.eval.provider.CorpusReaderPipeline 
org.cleartk.eval.provider.CorpusReaderProvider?

Original comment by phi...@ogren.info on 10 Feb 2011 at 7:58

GoogleCodeExporter commented 9 years ago

@Comment 27
I assume you were passing in either 0 or 1 total folds? Shouldn't that method 
just throw an exception for 0 or 1 total folds? (A different, more informative 
exception, of course.)

@Comment 28:
Yeah, I didn't rename anything you hadn't renamed, but I agree that should be 
called Provider.

Original comment by steven.b...@gmail.com on 11 Feb 2011 at 10:47

GoogleCodeExporter commented 9 years ago

Fixed in r2445, with renamings etc. up through r2818.

Original comment by steven.b...@gmail.com on 28 Apr 2011 at 10:02

Changed state: Fixed