komalsahedani / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Multipass Pipelines for Evaluation #260

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
For several tasks, we often need to compute statistics over our corpus for 
later use in feature extraction (consider TF*IDF, word co-occurrences, and 
language models).  Similarly for a scenario like semi-supervised learning, you 
may want to do clustering to identify labels for subsequent training of a 
classifier.  

Currently, the Evaluation package takes all of the pipelines provided by 
PreProcessorPipelineProvider, CleartkPipelineProvider, and EvaluationPipeline 
provider and concatenates them into a single, linear pipeline.  For the above 
scenarios, we really need multiple passes, first we need a pipeline for 
computing the statistics and then we need to run the second, standard pipeline.

There are few ways we might address this.  The simplest would be to add another 
pipeline provider to the evaluation class.  In a sense this creates a contract 
about what kind of evaluations are possible.  Alternatively, we may want to 
investigate how having something like a ComplexPipeline that accepts lists of 
lists of analysis engines (List<List<AnalysisEngineDescriptor>>) would work 
inside of an evaluation flow.  While this would give us the most flexibility 
for any scenario, it may muddle the training flows within Evaluation.

Original issue reported on code.google.com by lee.becker on 20 Oct 2011 at 5:11

GoogleCodeExporter commented 9 years ago
Fixed in r3901 with the new evaluation APIs from Issue 304. You can now have 
whatever shape of pipeline(s) you want in the train method.

Original comment by steven.b...@gmail.com on 3 May 2012 at 9:38

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 5 Aug 2012 at 8:48