marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
186 stars 26 forks source link

NumPy float format serialization issues #159

Closed ejhumphrey closed 7 years ago

ejhumphrey commented 7 years ago

To be clear, what follows is arguably user error, but more graceful failure would have saved me some time (and I'm getting different behavior on different environments, which didn't help).

I made the mistake of populating an annotation with np.float32 values. You can too with the following (note that this isn't what I did, but is functionally equivalent).

>>> ann = jams.Annotation(namespace='tag_open', time=0, duration=2.0)
>>> data = np.array([[1.0, 1.0, 0.5]], dtype=np.float32)
>>> for time, duration, conf in data:
    ann.data.add_observation(
            time=time, duration=duration, value='foo',
            confidence=conf)

Cool, no obvious problems yet.

# I look mostly fine.
>>> print(ann.__json__)  

{'data': [{'duration': 1.0, 'confidence': 0.5, 'value': 'foo', 'time': 1.0}], 'duration': 2.0, 
 'annotation_metadata': {'version': '', 'corpus': '', 'annotator': {}, 'annotation_rules': '',
 'annotation_tools': '', 'data_source': '', 'curator': {'name': '', 'email': ''}, 'validation': ''}, 
 'namespace': 'tag_open', 'sandbox': {}, 'time': 0}

However, these are (silently) np.float32s, though they don't quite look it (other numbers that don't have the same representation in single / double precision floats have a better tell), and generally won't in tracebacks.

# I fail :o(
>>> jam = jams.JAMS(annotations=[ann]).save("deleteme.jams", strict=False)  

<snip json traceback>
TypeError: 0.5 is not JSON serializable

I'm wondering if the API could help cast non-standard numerical types here? perhaps as a converter dict like in pandas? or perhaps just over data arrays? If default, we'd also want to warn on the casting behavior.

Or, at the bare minimum, it'd be helpful to throw a more informative warning like "hey you have nonserializable types! they are X and they're in field Y. Go back and look at what you're putting in the data structure"

thoughts?

ejhumphrey commented 7 years ago

I suppose this could also potentially be covered in the add_observation method of the annotation, which could have a strict_dtypes=True parameter...?

bmcfee commented 7 years ago

We try to do some auto-casting through the serialize_obj helper function, but it's not comprehensive. Maybe there's some better way to do serialization callbacks?

ejhumphrey commented 7 years ago

hmmm, interesting ... I'm going to get what I was actually working on before stepping in this up and running, and then will take a peak at the method you mention, see if I can't suggest something.

bmcfee commented 7 years ago

For the problem you described above: would it suffice to force-cast time and duration fields to float upon construction of an Observation object?

It doesn't solve the more general problem, but I'm not sure the more general problem has a good solution.

UPDATE: guh, I just noticed that this would also apply to the confidence field too. I'll have to think on this.

bmcfee commented 7 years ago

Oh weird. It looks like if you drop the dtype=np.float32, everything works properly. Apparently float64 is json serializable, but float32 is not.

bmcfee commented 7 years ago

I guess we might have to roll back to the old days and implement our own json serialization mixin class, ala https://stackoverflow.com/questions/27050108/convert-numpy-type-to-python