Open tracegilton opened 6 years ago
Interesting. I guess we should have thought this through before publishing the new code. @ibnesayeed do you have any thoughts?
Without having taken a look yet, I imagine you could do some metaprogramming to open the class and add the backend attribute. But, I'm not sure. I'll give it some thought.
Marshaling a complex object is always going to have the potential of breaking compatibility when the data-structure changes or any other attributes change. I remember wasting a month figuring out why a Weka model was not predicting anything and the culprit turns out to be the fact that the model was built using a different version of Weka than what was being used to to load the model file.
I think, we need to implement importer/exporter to serialize just the data without tying it to classes or other states of the object. The output can be something like JSON, YAML, or even Google's Protocol Buffer. This will not only help migrate models from one version to the other, but also from one backend to the other.
Thank you for the replies — Exporting as YAML seems like a good way for me to move the trained data into the new structure, but I am unfamiliar with the inner-workings of the backends code to know the impact this will have.
I was able to use a new ClassifierReborn exported to YAML as a template and added in my old data and training totals. When re-importing that YAML structure, it looks correct and it can classify data but training new data still fails.
If nothing else, I can use the YAML export to re-train a new classifier, looping over each word for weight
number of times :sweat_smile:
The point of an object-independent data-only serialization would to decouple the data from the class structure and object state. Exporting such data structure means looping through all the stored keys and serializing them in a way that is backend independent. Importing it means populating the backend store with those keys and values with the synthesized data rather than loading a ready-made object (as in case of marshaling).
@ibnesayeed I feel like an import/export class is probably the right solution. I can take a crack at it maybe this weekend? If you want to take a look, feel free.
Once you have a PR in place I will be happy to review it. My current priorities are keeping my hands very tight otherwise I would have implemented it.
Sounds good. I'll dig in as soon as I can.
We need to implement following two methods in each backend and have a proxy/alias method to call them from the main Bayes
class:
def import(yaml_data_file)
# Read the yaml_data_file and populate the backend in use
end
def export(yaml_data_file)
# Traverse the data structure in the used backend and serialize it to the yaml_data_file
end
Instead of specifying file name in the parameter, we can supply/return objects and move the serialization/deserialization responsibility in a task or in some other method. That way the YAML support will not be baked in, but other alternate formats can also be used without changing the underlying implementation.
Exported YAML data file (say, bayes-data.yml
) will looks something like this:
---
# Imported from ClassifierReborn::Bayes
total_words: 7
total_trainings: 3
category_counts:
- Ham:
- training: 2
- word: 4
- Spam:
- training: 1
- word: 3
categories:
- Ham:
- sunday: 1
- holiday: 1
- work: 2
- Spam:
- holiday: 1
- winner: 2
I'm trying to think if we need to do a minor release of a pre-backend version to make this work. Thoughts?
I'm trying to think if we need to do a minor release of a pre-backend version to make this work. Thoughts?
That's a good idea indeed. This feature can be released as minor versions for both pre-and post-backend releases at the same time.
Ok, I'll try building it against 2.1, and releasing this as 2.1.1 and 2.2.1. It kind of breaks semver to add new functionality in a patch version, but I don't see a way around that.
Yes, the backend change was big enough to warrant a major version bump, but we couldn't see it coming. So, for now 2.1.1 and 2.2.1 will do the trick if we don't be too religious about the semver.
I agree. I'll take a look tonight then and see if I can pull together a poc.
Ok I have a WIP pr at #174
Hi all,
I am attempting to upgrade a classifier that I built using a previous version of ClassifierReborn::Bayes. It looks like I initialized the classifier before backends were added, so now I am running into compatibility issues. I store the classifier structure on disk with Marshal and now when it is loaded it does not have the
backend
attribute that the newer gem expects.Is there a best practice for how to update the older classifier so that it will be compatible with the backends system?