jmschrei / yahmm

Yet Another Hidden Markov Model repository.
MIT License
249 stars 32 forks source link

JSON representation of objects #33

Closed jmschrei closed 9 years ago

jmschrei commented 10 years ago

I think it would be better if we change the underlying representation of all the objects to be just a JSON. For example, a distribution might be something like this:

{
    name: "NormalDistribution",
    parameters: [ 5, 2 ],
    summaries: []
}

and a state might look something like this:

{
    name: "s1",
    distribution: {
            name: "NormalDistribution",
            parameters: [ 5, 2 ],
            summaries: []
        },
    weight: 1.0
}

If we use the default JSON parser, then we can add more parameters at will without changing the read or write functions at all. The only problem is that it means that the text file is less readable.

adamnovak commented 10 years ago

Not sure JSON makes sense as the internal format (although proper Python objects aren't all that much smaller), but it could be a better serialization format than what we have now.

jmschrei commented 10 years ago

I meant as a serialization format, my mistake.

adamnovak commented 10 years ago

The Python objects we use already, probably. If we restrict ourselves to the types that the Python json module already knows how to serialize/deserialize, everything will have to be dicts and lists and other basic types, and we'll have to do silly things like monkey-patch HMM methods onto them.

At first glance it seems like we would want to implement the deserialization hooks described in < https://docs.python.org/2/library/json.html#json-to-py-table> and the serialization dispatch described in < https://docs.python.org/2/library/json.html#json.JSONEncoder.default> so we can save and load States and so on in JSON through the json module.

There are problems with the extensibility of this approach to new user-defined types though, because they would need their own serialization/deserailization hooks.

How do you see a json in-memory representation working?

On Mon, Sep 15, 2014 at 4:58 PM, Jacob Schreiber notifications@github.com wrote:

What do you think would be a good internal format?

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-55677455.

ewiger commented 10 years ago

Consider YAML, since JSON is a subset of YAML now.

On Tue, Sep 16, 2014 at 9:08 AM, adamnovak notifications@github.com wrote:

The Python objects we use already, probably. If we restrict ourselves to the types that the Python json module already knows how to serialize/deserialize, everything will have to be dicts and lists and other basic types, and we'll have to do silly things like monkey-patch HMM methods onto them.

At first glance it seems like we would want to implement the deserialization hooks described in < https://docs.python.org/2/library/json.html#json-to-py-table> and the serialization dispatch described in < https://docs.python.org/2/library/json.html#json.JSONEncoder.default> so we can save and load States and so on in JSON through the json module.

There are problems with the extensibility of this approach to new user-defined types though, because they would need their own serialization/deserailization hooks.

How do you see a json in-memory representation working?

On Mon, Sep 15, 2014 at 4:58 PM, Jacob Schreiber notifications@github.com

wrote:

What do you think would be a good internal format?

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-55677455.

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-55704250.

With best regards, Y.Y.

jmschrei commented 10 years ago

I didn't mean we'd have JSON in-memory representation, we'd use Python objects. I meant considering that when we write a model out to a file, it's written out as a JSON as opposed to the format we use now. This could allow us to easily add or remove attributes without changing the reading and writing functions, if they were written correctly.

What advantages would YAML give that JSON would not? I don't know that much about it.

adamnovak commented 10 years ago

That sounds a lot better to me. We still do have to solve this problem of deserializing user-created distributions that may or may not be in currently loaded modules. I think we should poke around inside Pickle, and see how it manages to load the right module for things even when people use features like "import numpy as np".

On Tue, Sep 16, 2014 at 4:41 PM, Jacob Schreiber notifications@github.com wrote:

I didn't mean we'd have JSON in-memory representation, we'd use Python objects. I meant considering that when we write a model out to a file, it's written out as a JSON as opposed to the format we use now. This could allow us to easily add or remove attributes without changing the reading and writing functions, if they were written correctly.

What advantages would YAML give that JSON would not? I don't know that much about it.

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-55830166.

jmschrei commented 10 years ago

Are you suggesting that make it so that people can custom define distributions, write their models out, and have people without the code for that distribution still be able to use the model? I'm not sure that's possible with just a JSON format. Maybe we could provide two options, one which is pickle-like but only machine-readable, and one that is human readable for the distributions which already have support.

adamnovak commented 10 years ago

No, I don't think we'll ever be able to really save the code for the distributions. But the use case I'm thinking of is more like this:

  1. System installed module provides a distribution
  2. I import the module, make an HMM using it, and save it
  3. I have a second script that doesn't import the module, and I load that HMM

In this case, loading the HMM ought to import the module; in fact, I don't really think we can see what the caller has imported, so we might need to import the module ourselves even if the script that wants to load the HMM has already done it.

It gets a little trickier if instead of a system-installed module, the place where the code lives is somewhere in the filesystem, like in MyDistributions.py next to the scripts. I don't know how well Pickle's logic handles that case.

On Thu, Sep 18, 2014 at 9:56 AM, Jacob Schreiber notifications@github.com wrote:

Are you suggesting that make it so that people can custom define distributions, write their models out, and have people without the code for that distribution still be able to use the model? I'm not sure that's possible with just a JSON format. Maybe we could provide two options, one which is pickle-like but only machine-readable, and one that is human readable for the distributions which already have support.

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-56069129.

jmschrei commented 10 years ago

I'm not sure if it's worth the extra effort when the user can simply write from MyDistributions import * to solve the problem. I'll take a look at what pickle does and see how difficult it would be.

adamnovak commented 10 years ago

That's the thing; I'm not sure if "from MyDistributions import *" is going to help if it's only in the caller's namespace, and not in yahmm. I don't really know much about serialization though.

On Thu, Sep 18, 2014 at 10:16 AM, Jacob Schreiber notifications@github.com wrote:

I'm not sure if it's worth the extra effort when the user can simply write from MyDistributions import * to solve the problem. I'll take a look at what pickle does and see how difficult it would be.

— Reply to this email directly or view it on GitHub https://github.com/jmschrei/yahmm/issues/33#issuecomment-56071959.

jmschrei commented 10 years ago

It'd depend on how we implemented it. If the name attribute of the object was the same as the class name as it is now, you can just eval, or one of its safe bretherin. That would have to be specified, though.

jmschrei commented 9 years ago

This has been merged into pomegranate, as will all future changes.