laito / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

MalletStringOutcomeDataWriter will use only the last word of an outcome when generating Mallet input instances #404

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

Train a Mallet classifier with String outcomes that include white spaces (" "). 
Then use the trained model to classify instances.

What is the expected output?

A classification result based on the set of outcomes used for training.

What do you see instead?

A classification result based on the last word of the outcomes used for 
training.

What version of the product are you using? On what operating system?

ClearTk version 1.4.1.
OS: Mac OS  version 10.8.5 and Linux Ubuntu

Comments:

My outcome labels are strings that contain spaces (" "). The ClearTk code that 
writes the training instances into the training-data.mallet file includes the 
outcome labels as the last field in the line. When parsing this file to 
serialize the instances into the Mallet format it assumes the outcome label is 
the substring after the last space in the line. In conclusion it is using only 
the last word of my outcome label as a label for the trainer.

See org.cleartk.classifier.mallet.InstanceListCreator.DataIterator.next.().

Original issue reported on code.google.com by MarceloT...@gmail.com on 2 May 2014 at 6:08

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 2 May 2014 at 7:25