What steps will reproduce the problem?
1. Have a corpus with mixed-case or punctuation
2. Run any of the algorithms
What is the expected output? What do you see instead?
The output would have things lower-cased as needed and the punctuation handled
according to user-specified rules.
Ideally, we could support some type of filter that would take in a Document and
transform it according to whatever rules it wanted. This might be useful to
incorporate with the token filter and IteratorFactory? Or it could be a step
that exists totally in GenericMain?
Original issue reported on code.google.com by David.Ju...@gmail.com on 17 Jul 2011 at 12:16
Original issue reported on code.google.com by
David.Ju...@gmail.com
on 17 Jul 2011 at 12:16