Closed paulgoetze closed 6 years ago
This has little to do with serialization. Here's an example. bug_example.rb.txt
If we call build_classifier(), then classify(), we will also trigger the error without serialization. However if we call train_with_instances(), then it works fine.
The reason is that train_with_instances() defines @training_instances on the classifier while build_classifier() doesn't. @training_instances is a Ruby space variable defined in Weka::Core::Utils (https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L19) but not in Java. When serializing it will not be stored together with the classifier into the binary file (and probably it shouldn't, 'cause it can be huge) because the Java SerializationHelper doesn't know its existence.
If we remove ensure_trained_with_instances! in https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L47 then calling classify() on a deserialized classifier will work.
Storing training dataset into @training_instances indeed is convenient but imho we shouldn't require its existence for deserialized classifiers.
@kcning raising an error when not having built/trained the classifier is expected behaviour from Weka.
From http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html#buildClassifier-weka.core.Instances- (description for classifyInstance
)
The instance has to belong to a dataset when it's being classified
Taking your example:
instances = Weka::Core::Instances.from_arff('dataset.arff')
instances.class_attribute = :class
j48 = Weka::Classifiers::Trees::J48.new
j48.classify_instance(instances.first) # using Java’s method classify_instance here
raises a Java::JavaLang::NullPointerException
The idea of storing the training dataset was to avoid this (for the user) rather not useful execption and substitute it with a more informative one (UnassignedTrainingInstancesError
).
If we remove ensure_trained_with_instances! in https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L47 then calling classify() on a deserialized classifier will work.
You are right. But this would leave us with the not useful Java exception and would not allow using values and instances as input for #classify
anymore. (it needs to know the training dataset to get the attributes info: https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L105-L117)
In my opinion, just passing an array of values to #classify
is pretty convenient. So in order to make it all work, we could do the following:
#classify
to take only an Instance object (no arrays of values anymore)
With this we could just catch the NullPointerException
and raise the more informative error.
Maybe the method #classify
should then be deprecated and we should add a new/overwrite #classify_instance
.#classify_values
method which takes an array of values as input and creates an instance on the fly
This again would need to know about the training dataset to create this InstanceIn much the same manner, #distribution_for
and also #evaluate
and #cross_validate
would have to be adjusted.
The overall restriction is then: You can only use an array of values for not deserialized classifiers (with available training_instances).
Umm, I guess I didn't express it well. Let's have another example. bug_example_2.zip
Here we built and serialize the J48 classifier. We load it into another object in the same running program and we can use it to classify instances if we call the Java method. We can't do so with classify() because it requires @training_instances to not be nil.
The instance being classified doesn't need to be in the training dataset, as long as it has the same format(same number and types of attributes). The loaded classifier is trained and works fine but we couldn't tell by @training_instances, so perhaps we need another way to check if a classifier is ready for its job.
Indeed, requiring users to pass Weka Instance to classify() is pretty inconvenient. We shouldn't store the whole training dataset though, if the purpose is to simply allow users to pass an array of values to classify().
For example, we can store the Attributes of the training dataset inside the classifier (in Ruby space) then we serialize it into another file during serialization. When we deserialize a classifier, we also deserialize the corresponding file so the loaded classifier knows its instance format.
Yeah, good point, I missed that we only need the attributes’s "schema". Thanks for the examples. Your idea with the extra serialized file looks like a decent solution to me.
I will work on an implementation based on your ideas within the next days.
BTW: it's the same scenario for clusterers: https://github.com/paulgoetze/weka-jruby/blob/01b2c08fe0779605cfe2da16f8e940531f3abd5e/lib/weka/clusterers/utils.rb
If a classifier is serialized and then deserialized again, you can’t call
#classify
,#evaluate
, and#cross_validate
on it, because an error is raised:It should be possible to call these directly on the deserialized classifier. See https://github.com/paulgoetze/weka-jruby/blob/develop/lib/weka/classifiers/utils.rb#L94, which is called at the beginning of the mentioned methods.