paulgoetze / weka-jruby

Machine Learning & Data Mining with JRuby
MIT License
65 stars 8 forks source link

A deserialized classifier cannot be used to classify/evaluate/cross validate #10

Closed paulgoetze closed 6 years ago

paulgoetze commented 8 years ago

If a classifier is serialized and then deserialized again, you can’t call #classify, #evaluate, and #cross_validate on it, because an error is raised:

Weka::UnassignedTrainingInstancesError: Classifier is not trained with Instances. 
You can set the training instances with #train_with_instances.

It should be possible to call these directly on the deserialized classifier. See https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L94, which is called at the beginning of the mentioned methods.

kcning commented 7 years ago

This has little to do with serialization. Here's an example. bug_example.rb.txt

If we call build_classifier(), then classify(), we will also trigger the error without serialization. However if we call train_with_instances(), then it works fine.

The reason is that train_with_instances() defines @training_instances on the classifier while build_classifier() doesn't. @training_instances is a Ruby space variable defined in Weka::Core::Utils (https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L19) but not in Java. When serializing it will not be stored together with the classifier into the binary file (and probably it shouldn't, 'cause it can be huge) because the Java SerializationHelper doesn't know its existence.

If we remove ensure_trained_with_instances! in https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L47 then calling classify() on a deserialized classifier will work.

Storing training dataset into @training_instances indeed is convenient but imho we shouldn't require its existence for deserialized classifiers.

paulgoetze commented 7 years ago

@kcning raising an error when not having built/trained the classifier is expected behaviour from Weka.

From http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html#buildClassifier-weka.core.Instances- (description for classifyInstance)

The instance has to belong to a dataset when it's being classified

Taking your example:

instances = Weka::Core::Instances.from_arff('dataset.arff')
instances.class_attribute = :class

j48 = Weka::Classifiers::Trees::J48.new
j48.classify_instance(instances.first) # using Java’s method classify_instance here

raises a Java::JavaLang::NullPointerException

The idea of storing the training dataset was to avoid this (for the user) rather not useful execption and substitute it with a more informative one (UnassignedTrainingInstancesError).

If we remove ensure_trained_with_instances! in https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L47 then calling classify() on a deserialized classifier will work.

You are right. But this would leave us with the not useful Java exception and would not allow using values and instances as input for #classify anymore. (it needs to know the training dataset to get the attributes info: https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L105-L117)

In my opinion, just passing an array of values to #classify is pretty convenient. So in order to make it all work, we could do the following:

In much the same manner, #distribution_for and also #evaluate and #cross_validate would have to be adjusted.

The overall restriction is then: You can only use an array of values for not deserialized classifiers (with available training_instances).

kcning commented 7 years ago

Umm, I guess I didn't express it well. Let's have another example. bug_example_2.zip

Here we built and serialize the J48 classifier. We load it into another object in the same running program and we can use it to classify instances if we call the Java method. We can't do so with classify() because it requires @training_instances to not be nil.

The instance being classified doesn't need to be in the training dataset, as long as it has the same format(same number and types of attributes). The loaded classifier is trained and works fine but we couldn't tell by @training_instances, so perhaps we need another way to check if a classifier is ready for its job.

Indeed, requiring users to pass Weka Instance to classify() is pretty inconvenient. We shouldn't store the whole training dataset though, if the purpose is to simply allow users to pass an array of values to classify().

For example, we can store the Attributes of the training dataset inside the classifier (in Ruby space) then we serialize it into another file during serialization. When we deserialize a classifier, we also deserialize the corresponding file so the loaded classifier knows its instance format.

paulgoetze commented 7 years ago

Yeah, good point, I missed that we only need the attributes’s "schema". Thanks for the examples. Your idea with the extra serialized file looks like a decent solution to me.

I will work on an implementation based on your ideas within the next days.

BTW: it's the same scenario for clusterers: https://github.com/paulgoetze/weka-jruby/blob/01b2c08fe0779605cfe2da16f8e940531f3abd5e/lib/weka/clusterers/utils.rb