[pyjams] discussion: data frame layer?

bmcfee commented 9 years ago

I've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:

It's a little too cumbersome to get at the data programmatically. For instance, to interface with mir_eval currently requires a considerable amount of slicing, dicing, and repacking.
Modifying a jams (pyjams) object in memory is similarly difficult. What if I trim a track's audio file and want to modify all time boundaries to match? There's no simple way to do this right now.

After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.

The key features are:

Hierarchical indexing, so that a field like "start" can have both "value" and "confidence" measurements associated with it
Mixed data type storage
All the pandas goodies (support for querying/slicing, missing data, etc)
Distinguishing time-typed and categorical data from other numeric values. (This bit would have to be enforced by the schema, but I think that's probably okay.)
Easy to slice and reinterpret a frame as a numpy array, allowing easy interaction with mir_eval (among others).

The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.

For example, a chord annotation record my_jams.chord[0].data, rather than being a list of Range objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.

If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:

Any measurement value corresponding to time (eg, event or interval boundary markers) should use the timedelta[ns] data type, rather than a float. Aside from making time values easy to find, pandas also provides some nice functionality for working with time series data.
Anything that could be interpreted as a "label" (eg, chord labels, segment ids, pitch class, instrument, etc) should use dtype 'category' rather than 'str' or 'object'.
If a value is not measured, use nan rather than hallucinating data or leaving the record empty. (I'm looking at you, confidence values.)

I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.

bmcfee commented 9 years ago

Discussed the above with @ejhumphrey in the context of the next version of the schema, and we're pretty close to having something working as of 981395502c7d782d824cbf151d4befc37c1a0277

A notebook demonstrating dataframe construction in jams objects is up here

This version of the schema reduces all observations to a simple structure: (time, duration, value, confidence). Note that value here can be either a scalar or string (eg, for onsets or chord labels), or a dense array (eg for melody); these correspond to "sparse" and "dense" types.

The top-level structure is essentially the same as what's in the current master branch cb01e04811b3fd937b1010e6028bae96712180d1 , except that Annotation objects now store observations in pandas DataFrames, rather than a list of observation objects.

Some issues yet to be resolved:

DataFrames require special serialization logic in order to correctly translate timedeltas and np.ndarray types to and from javascript. This breaks the current scheme of pushing all serialization up to JObject.
There's no obvious way to infer whether an annotation object should be sparse or dense. This matters because the dataframe construction will be slightly different for dense objects, since we'll be interpolating timestamps for each measurement to provide a unified interface. This could potentially be resolved by allowing the namespace (eg, "melody.hz" or "chords.harte") to imply an annotation type, and having two different corresponding classes (Annotation and DenseAnnotation).

justinsalamon commented 9 years ago

@bmcfee just a small comment/question about semantics - when you say "observation" what exactly are you referring to? In the published version of jams the are 4 "atomic" data types, of which observation is just one (observation, event, range, time_series).

bmcfee commented 9 years ago

This version has only a single atomic data type.

"range" generalizes "event" by allowing duration=0
"range" generalizes "time_series" by allowing value to take on array values. The namespace of the observation implies semantics for how the array is to be interpreted; probably always as uniformly sampled points spanning time and time+duration
"range" also generalizes "observation" (track-level annotations) by setting time=0 and letting duration=NaN (or Infinity, or the track duration, whichever convention we decide to adopt)

I think this covers all the use cases of the current schema, no?

justinsalamon commented 9 years ago

Yes, it does. It's pretty much what I assumed was going on, but I thought it important to clarify that by "observation" you were not referring to observation as defined in the current schema, to make the distinction explicit.

bmcfee commented 9 years ago

New notebook demonstrating functionality of 4e429394443938df7b44c7fffca03965ab4552c6

We ended up extending the pandas DataFrame object into a new type called JamsFrame. There is minimal discrepancy between the two; the new class is designed to make two things easy:

Serialization/deserialization with consistent field ordering (time, duration, value, confidence).
Conversion of time fields (timedelta64[ns] types) to and from javascript primitive floats.

This also allows us to do some cool things, like:

>>> ref_intervals, ref_labels = ref_jam.segment[0].data.to_interval_labels()
>>> est_intervals, est_labels = est_jam.segment[0].data.to_interval_labels()
>>> mir_eval.segment.pairwise(ref_intervals, ref_labels, est_intervals, est_labels)

I think this pretty much covers everything that we basically need for sparse types.

What's left to do is implement a second DataFrame class for dense types (eg melody) that automatically handles packing and unpacking of array-valued observations, and computes time samples based on the specified interval.

Additionally, we will need to build some kind of a registry that maps namespaces to data types. We can perhaps simplify this by assuming JamsFrame (ie sparse) by default, unless the namespace belongs to the "dense" set.

Of course, this can be manipulated at runtime if necessary, or maybe even defined in the schema somehow? Otherwise, it can just be a static object that lives in pyjams, eg,

>>> pyjams.dense_spaces = set(['melody.hz', 'mood.xy', ...])

At construction time, the Annotation object will consult the dense_spaces registry to see which type should be used for representing a particular data object.

Anyone see any pitfalls to this general setup?

justinsalamon commented 9 years ago

About dense types, as noted in #5: there's the question of explicitly storing time samples vs computing them from a specified interval. According to the current schema, time samples are stored explicitly in the jams file. My personal preference would be to keep it this way - even though it is less elegant than simply storing a hop size, it ensures different users will be using exactly the same time values (down to the number of decimal places), which I think is essential when treating an annotation as a reference. Also, if times are implicit there's the risk that someone will compute them differently outside of the pyjams ecosystem. More complaints about implicit timestamps can be found in #5 :)

bmcfee commented 9 years ago

I'm sympathetic to both views, actually.

Let's look at three options:

Use the sparse array for everything. Each sample in a time series gets its own object/dictionary.
- PRO: simple to code, essentially all of our work is already done.
- CON: really inefficient storage.
Have a new type with almost the same signature, excepting that the value field is now array-valued. Samples are implicitly defined to be uniformly over the range [time, time+duration), and the "hop" is inferred from the number of samples.
- PRO: minimal storage overhead
- CON: sample times are implicit (though well-defined by convention). However, the logic to export a dataframe to this format can be a little involved, since we'll have to infer the total duration based on the implied sampling rate. In the corner case of n=1, this is impossible. I think I don't like this.
Have a new type in which time, duration, and value are all vector-valued. This differs from (1) by json-encoding the entire time series as a single object, rather than each row as a single object. This seems closest to what @justinsalamon is describing above.
- PRO: unambiguous, still reasonably efficient, easy to import/export, also allows dense non-uniform sampling (which 2 does not).
- CON: can be inefficient in the common (uniform sampling) case.

Shall we put it to a vote?

justinsalamon commented 9 years ago

If we vote, I vote 3. Additional comments:

"(though well-defined by convention)" - HAH. Have you seen the existing melody f0 annotations?
In mir_eval we (rachel) recently found a bug that resulted in lower-than-perfect scores for comparing a melody reference against itself, and the reason was that the timestamps of one sequence were truncated to X decimal places whilst the timestamps for the other sequence weren't. If we cant ensure that everyone converts (n_samples,duration) into exactly the same set of timestamp values (and I think we can't), then we should prefer to store them explicitly, at the expense of inefficient storage (for melody at least these were already stored explicitly, so it wouldn't be adding significant baggage over the existing annotations).

bmcfee commented 9 years ago

"(though well-defined by convention)" - HAH. Have you seen the existing melody f0 annotations?

of course not :) But since we're defining the spec and implementing the interface (at least, those interfaces most likely to be used by everyone), we're in a much more privileged position to dictate conventions than, say, folks using lab files.

If we cant ensure that everyone converts (n_samples,duration) into exactly the same set of timestamp values (and I think we can't), then we should prefer to store them explicitly, at the expense of inefficient storage

This actually seems like an argument in favor of doing it implicitly on load, rather than serializing each timestamp observation independently. (float<->string conversion can be dicey once rounding gets involved.) However, in that particular example, I think it's the evaluation's fault for being so sensitive to sub-microsecond deviations.

A stronger argument for storing timestamps and durations explicitly is that it allows for both non-uniform and incomplete coverage of dense observations. I see this as a bigger win than the efficiency loss due to often redundant encoding. As I said above, it also simplifies the logic for import/export since we would have no reason to infer any times, and that's a good thing.

bmcfee commented 9 years ago

I also think 3 is the best option, et voila, it's working already as of 3bb193e4001348dbb6c50d0155bdf68c122aed5f

Sparse and dense data get identical representation as JamsFrames, so we only need one class. The only difference is in how they serialize, as displayed in cells 47 and 48 of the notebook above. The really nice thing here is that DataFrame does all the heavy lifting, and we only need to change one parameter to differentiate between dense and sparse export: orient='list' vs orient='records'. Importing from json is totally transparent.

This is now controlled by a class attribute JamsFrame.dense, which as described in the previous post, can be toggled in the Annotation object according to schema and/or a dynamic registry mapping namespace -> (sparse | dense).

justinsalamon commented 9 years ago

Nice! schema: since there's only one type of annotation, we could rename ObservationAnnotation to just Annotation. Or, have 2 types: DenseAnnotation and SparseAnnotation?

bmcfee commented 9 years ago

I think what it should be is:

Annotation -> DenseAnnotation | Array of SparseAnnotations

I think either encoding should be considered valid for any task, but we should prefer Sparse by default except for certain namespaces (eg melody). It's probably not worth our time to enforce namespaces using one encoding vs the other at the schema level.

bmcfee commented 9 years ago

I added a bit of syntactic sugar and constructor methods, and it's pretty simple to use at this point (in pyjams).

Simplified example notebook demonstrating how to translate a .lab chord annotation file into JAMS now.

urinieto commented 9 years ago

Just wanted to say that I finally reviewed this question with its multiple associated note books and I should say I fully support the usage of pandas in jams.

I would also vote (3).

Moreover, the JamsFrame makes a lot of sense and seems to make things even clearer.

Since I already use pandas in msaf, I am really looking forward to integrating this new jams version with msaf, so that I can provide some feedback when using this version in the wild.

bmcfee commented 9 years ago

Over the weekend, I wrote an import script for the SMC beat tracking dataset.

Here's an ipynb showing how this all works out, now that the dust has settled. I'm pretty happy with it.

rabitt commented 9 years ago

This is great.

marl / jams

[pyjams] discussion: data frame layer? #8