marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
183 stars 26 forks source link

[pyjams] discussion: data frame layer? #8

Closed bmcfee closed 9 years ago

bmcfee commented 9 years ago

I've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:

After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.

The key features are:

The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.

For example, a chord annotation record my_jams.chord[0].data, rather than being a list of Range objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.

If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:

I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.

bmcfee commented 9 years ago

Discussed the above with @ejhumphrey in the context of the next version of the schema, and we're pretty close to having something working as of 981395502c7d782d824cbf151d4befc37c1a0277

A notebook demonstrating dataframe construction in jams objects is up here

This version of the schema reduces all observations to a simple structure: (time, duration, value, confidence). Note that value here can be either a scalar or string (eg, for onsets or chord labels), or a dense array (eg for melody); these correspond to "sparse" and "dense" types.

The top-level structure is essentially the same as what's in the current master branch cb01e04811b3fd937b1010e6028bae96712180d1 , except that Annotation objects now store observations in pandas DataFrames, rather than a list of observation objects.

Some issues yet to be resolved:

justinsalamon commented 9 years ago

@bmcfee just a small comment/question about semantics - when you say "observation" what exactly are you referring to? In the published version of jams the are 4 "atomic" data types, of which observation is just one (observation, event, range, time_series).

bmcfee commented 9 years ago

This version has only a single atomic data type.

I think this covers all the use cases of the current schema, no?

justinsalamon commented 9 years ago

Yes, it does. It's pretty much what I assumed was going on, but I thought it important to clarify that by "observation" you were not referring to observation as defined in the current schema, to make the distinction explicit.

bmcfee commented 9 years ago

New notebook demonstrating functionality of 4e429394443938df7b44c7fffca03965ab4552c6

We ended up extending the pandas DataFrame object into a new type called JamsFrame. There is minimal discrepancy between the two; the new class is designed to make two things easy:

  1. Serialization/deserialization with consistent field ordering (time, duration, value, confidence).
  2. Conversion of time fields (timedelta64[ns] types) to and from javascript primitive floats.

This also allows us to do some cool things, like:

>>> ref_intervals, ref_labels = ref_jam.segment[0].data.to_interval_labels()
>>> est_intervals, est_labels = est_jam.segment[0].data.to_interval_labels()
>>> mir_eval.segment.pairwise(ref_intervals, ref_labels, est_intervals, est_labels)

I think this pretty much covers everything that we basically need for sparse types.

What's left to do is implement a second DataFrame class for dense types (eg melody) that automatically handles packing and unpacking of array-valued observations, and computes time samples based on the specified interval.

Additionally, we will need to build some kind of a registry that maps namespaces to data types. We can perhaps simplify this by assuming JamsFrame (ie sparse) by default, unless the namespace belongs to the "dense" set.

Of course, this can be manipulated at runtime if necessary, or maybe even defined in the schema somehow? Otherwise, it can just be a static object that lives in pyjams, eg,

>>> pyjams.dense_spaces = set(['melody.hz', 'mood.xy', ...])

At construction time, the Annotation object will consult the dense_spaces registry to see which type should be used for representing a particular data object.

Anyone see any pitfalls to this general setup?

justinsalamon commented 9 years ago

About dense types, as noted in #5: there's the question of explicitly storing time samples vs computing them from a specified interval. According to the current schema, time samples are stored explicitly in the jams file. My personal preference would be to keep it this way - even though it is less elegant than simply storing a hop size, it ensures different users will be using exactly the same time values (down to the number of decimal places), which I think is essential when treating an annotation as a reference. Also, if times are implicit there's the risk that someone will compute them differently outside of the pyjams ecosystem. More complaints about implicit timestamps can be found in #5 :)

bmcfee commented 9 years ago

I'm sympathetic to both views, actually.

Let's look at three options:

  1. Use the sparse array for everything. Each sample in a time series gets its own object/dictionary.
    • PRO: simple to code, essentially all of our work is already done.
    • CON: really inefficient storage.
  2. Have a new type with almost the same signature, excepting that the value field is now array-valued. Samples are implicitly defined to be uniformly over the range [time, time+duration), and the "hop" is inferred from the number of samples.
    • PRO: minimal storage overhead
    • CON: sample times are implicit (though well-defined by convention). However, the logic to export a dataframe to this format can be a little involved, since we'll have to infer the total duration based on the implied sampling rate. In the corner case of n=1, this is impossible. I think I don't like this.
  3. Have a new type in which time, duration, and value are all vector-valued. This differs from (1) by json-encoding the entire time series as a single object, rather than each row as a single object. This seems closest to what @justinsalamon is describing above.
    • PRO: unambiguous, still reasonably efficient, easy to import/export, also allows dense non-uniform sampling (which 2 does not).
    • CON: can be inefficient in the common (uniform sampling) case.

Shall we put it to a vote?

justinsalamon commented 9 years ago

If we vote, I vote 3. Additional comments:

bmcfee commented 9 years ago

"(though well-defined by convention)" - HAH. Have you seen the existing melody f0 annotations?

of course not :) But since we're defining the spec and implementing the interface (at least, those interfaces most likely to be used by everyone), we're in a much more privileged position to dictate conventions than, say, folks using lab files.

If we cant ensure that everyone converts (n_samples,duration) into exactly the same set of timestamp values (and I think we can't), then we should prefer to store them explicitly, at the expense of inefficient storage

This actually seems like an argument in favor of doing it implicitly on load, rather than serializing each timestamp observation independently. (float<->string conversion can be dicey once rounding gets involved.) However, in that particular example, I think it's the evaluation's fault for being so sensitive to sub-microsecond deviations.

A stronger argument for storing timestamps and durations explicitly is that it allows for both non-uniform and incomplete coverage of dense observations. I see this as a bigger win than the efficiency loss due to often redundant encoding. As I said above, it also simplifies the logic for import/export since we would have no reason to infer any times, and that's a good thing.

bmcfee commented 9 years ago

I also think 3 is the best option, et voila, it's working already as of 3bb193e4001348dbb6c50d0155bdf68c122aed5f

Sparse and dense data get identical representation as JamsFrames, so we only need one class. The only difference is in how they serialize, as displayed in cells 47 and 48 of the notebook above. The really nice thing here is that DataFrame does all the heavy lifting, and we only need to change one parameter to differentiate between dense and sparse export: orient='list' vs orient='records'. Importing from json is totally transparent.

This is now controlled by a class attribute JamsFrame.dense, which as described in the previous post, can be toggled in the Annotation object according to schema and/or a dynamic registry mapping namespace -> (sparse | dense).

justinsalamon commented 9 years ago

Nice! schema: since there's only one type of annotation, we could rename ObservationAnnotation to just Annotation. Or, have 2 types: DenseAnnotation and SparseAnnotation?

bmcfee commented 9 years ago

I think what it should be is:

I think either encoding should be considered valid for any task, but we should prefer Sparse by default except for certain namespaces (eg melody). It's probably not worth our time to enforce namespaces using one encoding vs the other at the schema level.

bmcfee commented 9 years ago

I added a bit of syntactic sugar and constructor methods, and it's pretty simple to use at this point (in pyjams).

Simplified example notebook demonstrating how to translate a .lab chord annotation file into JAMS now.

urinieto commented 9 years ago

Just wanted to say that I finally reviewed this question with its multiple associated note books and I should say I fully support the usage of pandas in jams.

I would also vote (3).

Moreover, the JamsFrame makes a lot of sense and seems to make things even clearer.

Since I already use pandas in msaf, I am really looking forward to integrating this new jams version with msaf, so that I can provide some feedback when using this version in the wild.

bmcfee commented 9 years ago

Over the weekend, I wrote an import script for the SMC beat tracking dataset.

Here's an ipynb showing how this all works out, now that the dust has settled. I'm pretty happy with it.

rabitt commented 9 years ago

This is great.