Open karlnapf opened 7 years ago
Hi, I'm new here and wish to take up this issue. So, kindly guide me through.
Well, write something that trains a models, serializes it, deserializes it, and then applying it to test data gives the same results. Then automatically make this happen for all models
Actually I'm new to Open Source and know only the basics Machine Learning(with no experience in any project) so can you please explain to me in detail or assign me to an easier issue. I want to be involved in this project, so please guide me.
Please read the "getting involved" guide in the wiki, about starting with Shogun. We can't hold your hands unfortunately. I suggest you pick something simpler if you don't know how to start here.
@karlnapf i want to take this up. Is serialisation module implemented or not. Or should we only implement the testing part of it.
We have serialization working at the moment (at least in theory). The details under the hood will soon be replaced, but that doesn't matter for this task.
Try to come up with a simple way to test serialization / de-serialization of all models.
Algorithm for supervised methods (CMachine
subclasses)
Unsupervised methods and other will work similarly, but let's start with the supervised ones.
The best thing was if we were doing this automatically for modes of similar classes (supervised, unsupervised, kernel machine, gaussian process, etc)
You can start via writing a unit test for a single model, as a prototype. However, eventually, we want to use some template engine to generate unit tests for all implementations we have of a particular class. We did a similar thing for serialization itself here, using jinja2. I would start writing a simple test for say a CLinearMachine
and then make it automatically be generated for all subclasses.
The cmake file that generates the tests using jinja2 is here
@karlnapf I've made a prototype for LinearMachine
subclasses both for binary classification and regression using jinja2 which follows the steps listed above (and it checks if the predictions pre and post serialization are equal, don't know if it's useful).
https://github.com/micmn/shogun/blob/feature/serialization-test/tests/unit/io/LinearMachineSerialization_unittest.cc.jinja2
Generates: https://gist.github.com/micmn/6e447c6c491350e22c0efcb5091701ad
Is there a convenience function to generate random train/test data?
@micmn this is a great start! i'd love to have that fixture class actually be implemented somewhere else and that everybody could use it withing the unit tests as currently every unit test uses/implements their own data generators... @karlnapf ?
@micmn i guess i just answered your question implicitly... no we dont have.. but i believe now that we are talking about we would need such functionality i believe within the library.. @karlnapf ?
The data should definitely be generated elsewhere. But in fact, we don't we just use the available datasets we have? See the meta examples, I made sure that all datasets used in there do actually make some sense
I dont think you need to distinguish in the code between classification and regression. If you just used apply
rather than say apply_regression
, then you get a CLabels
pointer, but serializing it actually should work.
...of course this is modulo the data passed -- maybe we can have a class that you can give a CMachine
instance and that returns you a pair of CFeatures
and CLabels
that correspond to the machine's problem type? Then the test code would be agnostic of the underlying problem type and therefore a but cleaner?
While the explicit list of things to be tested at the end of the file is great, I wonder whether we could apply this to all CMachine
instances by default, and have a blacklist instead? For that, we need some code that works on generic CMachine
pointers first (maybe using the data generation trick I mentioned above)
Great start indeed, this is exactly what I had in mind. Want to send a PR for further discussion?
BTW I think we want a travis option to disable this thing as well, like -DTRAVIS_DISABLE_SERIALIZATION_TESTS
Great start indeed, this is exactly what I had in mind. Want to send a PR for further discussion?
Yeah, which name and folder are more appropriate for the template?
I would put it as tests/unit/base/trained_model_serialization_unittest.cc.jinja2
btw i'm not sure if we wanna have this at all part of the unit testing framework....
Yeah I was thinking that, but then I decided I think it should. Because this should work for all models. Dataset should be tiny though...
This is done, no? @micmn @vigsterkr
These are the classes that are tested so far: CLinearMachine, CKernelMachine, CLinearMulticlassMachine, CKernelMulticlassMachine, CNativeMulticlassMachine
You know what is missing? Shall I re-open this to finish?
BTW I think we can also do these tests without jinja2, as done here https://github.com/shogun-toolbox/shogun/blob/develop/tests/unit/base/SGObjectAll_unittest.cc#L235
Yeah in the python script instead of calling jinja we could generate an header that lists all the machines and then use a typed test, it should be doable.
Missing: for sure we talked about GPs.
Tree Machines as well (they dont even serialize atm)
We need to make sure that all Shogun can dump trained models to disc!
Basically, every learning algorithm in shogun, that is based on training/test (so starting with CMachine), should be trained on a (small) exampe dataset, then serialized, then de-serialized and made sure that it can be applied to test data.
This task is to automate that.