N-Gram Extraction Creates FeatureNames with Invalid Characters

hcouch21 / styloproject

Automatically exported from code.google.com/p/styloproject

0 stars 0 forks source link

What steps will reproduce the problem?
1. Make sure the NGramFreq LinguisticFeature is enabled.
2. Train on any corpus
3. Look inside the training file <corpus>/stylo/training-weka.arff for example

What is the expected output?
@attributes should be defined by 1 word using alphanumeric and underscores (no 
spaces allowed).

What do you see instead?
@attribute Bigram_walk ( numeric
@attribute Bigram_, across numeric
@attribute Unigram_. numeric

It's possible some punctuation can be used but perhaps we want to replace all 
symbols with their written out form so that last one would be: @attribute 
Unigram_Period

Original issue reported on code.google.com by Matthew.Tornetta@gmail.com on 6 Apr 2011 at 7:04

This is fixed now .. it was actually a much larger issue. The punctuation wasn't the problem; it was the spaces, percent signs, and unicode characters. Some of the changes that went in: - All feature names are now in quotes (to handle the spaces and percent signs) - The ARFF files are read and written in UTF-8 format (to handle unicode characters)

hcouch21 / styloproject

N-Gram Extraction Creates FeatureNames with Invalid Characters #4