hcouch21 / styloproject

Automatically exported from code.google.com/p/styloproject
0 stars 0 forks source link

N-Gram Extraction Creates FeatureNames with Invalid Characters #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Make sure the NGramFreq LinguisticFeature is enabled.
2. Train on any corpus
3. Look inside the training file <corpus>/stylo/training-weka.arff for example

What is the expected output?
@attributes should be defined by 1 word using alphanumeric and underscores (no 
spaces allowed).

What do you see instead?
@attribute Bigram_walk ( numeric
@attribute Bigram_, across numeric
@attribute Unigram_. numeric

It's possible some punctuation can be used but perhaps we want to replace all 
symbols with their written out form so that last one would be: @attribute 
Unigram_Period

Original issue reported on code.google.com by Matthew.Tornetta@gmail.com on 6 Apr 2011 at 7:04

GoogleCodeExporter commented 9 years ago
This is fixed now .. it was actually a much larger issue.  The punctuation 
wasn't the problem; it was the spaces, percent signs, and unicode characters.

Some of the changes that went in:
- All feature names are now in quotes (to handle the spaces and percent signs)
- The ARFF files are read and written in UTF-8 format (to handle unicode 
characters)

Original comment by ssbahe...@gmail.com on 11 Apr 2011 at 3:27