mkabbasi / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Implement ZeroMeanUnitStddev feature normalization #267

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Here's the proposal we wrote on my whiteboard for creating a ZeroMeanUnitStddev 
feature transform.  The idea is roughly:

* On the first pass through the pipeline, ZeroMeanUnitStddev creates fake 
features that serve as placeholders, and the DataWriter is set to a new class, 
InstanceWriter, which simply serializes the instances to a file.

* Outside of the pipeline, the user loads the instances with an InstanceWriter 
static method, trains the ZeroMeanUnitStddev, and writes it to a file.

* On the second pass through the pipeline, ZeroMeanUnitStddev is loaded from 
the file and creates real features with the normalized values, and a real 
DataWriter formats the instances for a classifier.

* (We could eliminate this second pass during training through additional 
methods for working with instances directly, but the second pass above is still 
what would happen during testing.)

== The train method ==

// first pass: just write instances serialized
SimplePipeline.runPipeline(reader, 
AnalysisEngineFactory.createPrimitiveDescription(
    MyAnnotator.class,
    CleartkAnnotator.PARAM_DATA_WRITER_FACTORY_CLASS_NAME,
    InstanceWriter.class.getName(),
    DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
    outputDirectory.getPath(),
    MyAnnotator.PARAM_MEAN_STDEV_URL,
    null)); // intentionally null to signal that ZeroMeanUnitStddev is not trained

// load instances and train the extractor
Iterable<Instance<String>> instances = 
InstanceWriter.loadFromDirectory(outputDirectory);
ZeroMeanUnitStddev extractor = new ZeroMeanUnitStddev();
extractor.train(instances);
URL meanStddevURL = ...
extractor.save(meanStddevURL);

// second pass: write data in the classifier format
SimplePipeline.runPipeline(reader, 
AnalysisEngineFactory.createPrimitiveDescription(
    MyAnnotator.class,
    CleartkAnnotator.PARAM_DATA_WRITER_FACTORY_CLASS_NAME,
    DefaultMostFrequentStringDataWriterFactory.class.getName(),
    DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
    outputDirectory.getPath(),
    MyAnnotator.PARAM_MEAN_STDEV_URL,
    meanStddevURL));

== MyAnnotator initialization ==

class MyAnnotator extends JCasAnnotator_ImplBase {
  @ConfigurationParameter
  private URL meanStdevUrl;
  public static String PARAM_MEAN_STDEV_URL = ...
  private ZeroMeanUnitStddev extractor;

  public void initalize(UimaContext context) {
    this.extractor = new ZeroMeanUnitStddev(new SpannedTextExtractor()/* , ... */);
    if (this.meanStdevUrl != null) {
      this.extractor.load(this.meanStdevUrl);
    }
  }
  ...
}

== ZeroMeanUnitStddev and its fake feature ==

interface TrainableFromInstances<OUTCOME_TYPE> {
  public void train(Iterable<? extends Instance<? extends OUTCOME_TYPE>> instances);
  public void save(URL url);
  public void load(URL url);
}
class ZeroMeanUnitStddev implements SimpleFeatureExtractor, 
TrainableFromInstances<Object> {
  private boolean isTrained;
  private SimpleFeatureExtractor[] extractors;
  private Map<String, Double> means;
  private Map<String, Double> stddevs;

  public ZeroMeanUnitStddev(SimpleFeatureExtractor... extractors) {
    this.extractors = extractors;
    this.isTrained = false;
  }
  public List<Feature> extract(JCas view, Annotation focusAnnotation)
      throws CleartkExtractorException {
    List<Feature> result = new ArrayList<Feature>();
    for (SimpleFeatureExtractor extractor : this.extractors) {
      for (Feature feature : extractor.extract(view, focusAnnotation)) {
        if (!this.isTrained) {
          result.add(new ZeroMeanUnitStddevFeature(feature));
        } else {
          double mean = this.means.get(feature.getName());
          double stddev = this.stddevs.get(feature.getName());
          String name = Feature.createName("ZMUS", feature.getName());
          double value = ((Number) feature.getValue()).doubleValue();
          result.add(new Feature(name, (value - mean) / stddev));
        }
      }
    }
    return result;
  }
  public void train(Iterable<? extends Instance<? extends Object>> instances) {
    Multimap<String, Double> featureValues = HashMultimap.create();
    for (Instance<? extends Object> instance : instances) {
      for (Feature feature : instance.getFeatures()) {
        if (feature instanceof ZeroMeanUnitStddevFeature) {
          featureValues.put(feature.getName(), ((Number) feature.getValue()).doubleValue());
        }
      }
    }
    this.means = ...
    this.stddevs = ...
    this.isTrained = true;
  }
  public void save(URL url) {
    ...write(this.means)...
    ...write(this.stddevs)...
  }
  public void load(URL url) {
    this.means = ...
    this.stddevs = ...
    this.isTrained = true;
  }
}

class ZeroMeanUnitStddevFeature extends Feature {
  private Feature feature;
  public ZeroMeanUnitStddevFeature(Feature feature) { this.feature = feature; }
  public Object getValue() { throw new UnsupportedOperationException(); }
  public String getName() { throw new UnsupportedOperationException(); }

== 

Original issue reported on code.google.com by steven.b...@gmail.com on 12 Jan 2012 at 2:01

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 12 Jan 2012 at 2:04

GoogleCodeExporter commented 9 years ago
A first pass at this has now been implemented and checked in.  We should now 
discuss how well we like this use model and the naming conventions.  I have a 
feeling that we will want to refactor the recently checked in files into 
different packages.

Original comment by lee.becker on 31 Jan 2012 at 8:14

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:46

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:46

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 26 Jul 2012 at 3:59