Closed bmcfee closed 9 years ago
Discussed the above with @ejhumphrey in the context of the next version of the schema, and we're pretty close to having something working as of 981395502c7d782d824cbf151d4befc37c1a0277
A notebook demonstrating dataframe construction in jams objects is up here
This version of the schema reduces all observations to a simple structure: (time, duration, value, confidence)
. Note that value
here can be either a scalar or string (eg, for onsets or chord labels), or a dense array (eg for melody); these correspond to "sparse" and "dense" types.
The top-level structure is essentially the same as what's in the current master branch cb01e04811b3fd937b1010e6028bae96712180d1 , except that Annotation
objects now store observations in pandas DataFrames, rather than a list of observation objects.
Some issues yet to be resolved:
Annotation
and DenseAnnotation
).@bmcfee just a small comment/question about semantics - when you say "observation" what exactly are you referring to? In the published version of jams the are 4 "atomic" data types, of which observation is just one (observation, event, range, time_series).
This version has only a single atomic data type.
duration=0
value
to take on array values. The namespace of the observation implies semantics for how the array is to be interpreted; probably always as uniformly sampled points spanning time
and time+duration
time=0
and letting duration=NaN
(or Infinity, or the track duration, whichever convention we decide to adopt)I think this covers all the use cases of the current schema, no?
Yes, it does. It's pretty much what I assumed was going on, but I thought it important to clarify that by "observation" you were not referring to observation as defined in the current schema, to make the distinction explicit.
New notebook demonstrating functionality of 4e429394443938df7b44c7fffca03965ab4552c6
We ended up extending the pandas DataFrame object into a new type called JamsFrame. There is minimal discrepancy between the two; the new class is designed to make two things easy:
This also allows us to do some cool things, like:
>>> ref_intervals, ref_labels = ref_jam.segment[0].data.to_interval_labels()
>>> est_intervals, est_labels = est_jam.segment[0].data.to_interval_labels()
>>> mir_eval.segment.pairwise(ref_intervals, ref_labels, est_intervals, est_labels)
I think this pretty much covers everything that we basically need for sparse types.
What's left to do is implement a second DataFrame class for dense types (eg melody) that automatically handles packing and unpacking of array-valued observations, and computes time samples based on the specified interval.
Additionally, we will need to build some kind of a registry that maps namespaces to data types. We can perhaps simplify this by assuming JamsFrame (ie sparse) by default, unless the namespace belongs to the "dense" set.
Of course, this can be manipulated at runtime if necessary, or maybe even defined in the schema somehow? Otherwise, it can just be a static object that lives in pyjams, eg,
>>> pyjams.dense_spaces = set(['melody.hz', 'mood.xy', ...])
At construction time, the Annotation
object will consult the dense_spaces
registry to see which type should be used for representing a particular data object.
Anyone see any pitfalls to this general setup?
About dense types, as noted in #5: there's the question of explicitly storing time samples vs computing them from a specified interval. According to the current schema, time samples are stored explicitly in the jams file. My personal preference would be to keep it this way - even though it is less elegant than simply storing a hop size, it ensures different users will be using exactly the same time values (down to the number of decimal places), which I think is essential when treating an annotation as a reference. Also, if times are implicit there's the risk that someone will compute them differently outside of the pyjams ecosystem. More complaints about implicit timestamps can be found in #5 :)
I'm sympathetic to both views, actually.
Let's look at three options:
value
field is now array-valued. Samples are implicitly defined to be uniformly over the range [time, time+duration)
, and the "hop" is inferred from the number of samples.
n=1
, this is impossible. I think I don't like this.time
, duration
, and value
are all vector-valued. This differs from (1) by json-encoding the entire time series as a single object, rather than each row as a single object. This seems closest to what @justinsalamon is describing above.
Shall we put it to a vote?
If we vote, I vote 3. Additional comments:
"(though well-defined by convention)" - HAH. Have you seen the existing melody f0 annotations?
of course not :) But since we're defining the spec and implementing the interface (at least, those interfaces most likely to be used by everyone), we're in a much more privileged position to dictate conventions than, say, folks using lab files.
If we cant ensure that everyone converts (n_samples,duration) into exactly the same set of timestamp values (and I think we can't), then we should prefer to store them explicitly, at the expense of inefficient storage
This actually seems like an argument in favor of doing it implicitly on load, rather than serializing each timestamp observation independently. (float<->string conversion can be dicey once rounding gets involved.) However, in that particular example, I think it's the evaluation's fault for being so sensitive to sub-microsecond deviations.
A stronger argument for storing timestamps and durations explicitly is that it allows for both non-uniform and incomplete coverage of dense observations. I see this as a bigger win than the efficiency loss due to often redundant encoding. As I said above, it also simplifies the logic for import/export since we would have no reason to infer any times, and that's a good thing.
I also think 3 is the best option, et voila, it's working already as of 3bb193e4001348dbb6c50d0155bdf68c122aed5f
Sparse and dense data get identical representation as JamsFrames, so we only need one class. The only difference is in how they serialize, as displayed in cells 47 and 48 of the notebook above. The really nice thing here is that DataFrame does all the heavy lifting, and we only need to change one parameter to differentiate between dense and sparse export: orient='list'
vs orient='records'
. Importing from json is totally transparent.
This is now controlled by a class attribute JamsFrame.dense
, which as described in the previous post, can be toggled in the Annotation object according to schema and/or a dynamic registry mapping namespace -> (sparse | dense)
.
Nice! schema: since there's only one type of annotation, we could rename ObservationAnnotation to just Annotation. Or, have 2 types: DenseAnnotation and SparseAnnotation?
I think what it should be is:
I think either encoding should be considered valid for any task, but we should prefer Sparse by default except for certain namespaces (eg melody). It's probably not worth our time to enforce namespaces using one encoding vs the other at the schema level.
I added a bit of syntactic sugar and constructor methods, and it's pretty simple to use at this point (in pyjams).
Simplified example notebook demonstrating how to translate a .lab chord annotation file into JAMS now.
Just wanted to say that I finally reviewed this question with its multiple associated note books and I should say I fully support the usage of pandas in jams.
I would also vote (3).
Moreover, the JamsFrame makes a lot of sense and seems to make things even clearer.
Since I already use pandas in msaf, I am really looking forward to integrating this new jams version with msaf, so that I can provide some feedback when using this version in the wild.
Over the weekend, I wrote an import script for the SMC beat tracking dataset.
Here's an ipynb showing how this all works out, now that the dust has settled. I'm pretty happy with it.
This is great.
I've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:
After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.
The key features are:
The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.
For example, a chord annotation record
my_jams.chord[0].data
, rather than being a list ofRange
objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:
timedelta[ns]
data type, rather than a float. Aside from making time values easy to find, pandas also provides some nice functionality for working with time series data.nan
rather than hallucinating data or leaving the record empty. (I'm looking at you, confidence values.)I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.