marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
186 stars 27 forks source link

Discussion: revisiting the pandas / jamsframe idea #148

Closed bmcfee closed 7 years ago

bmcfee commented 7 years ago

Opening this up for discussion again: what do people think about dropping pandas from jams? I'm not yet advocating this position, but I think it's worth thinking through.

Note, this would not affect the schema, only the data structures used in the python implementation to parse it.

Why do we use pandas?

Initially, the arguments in favor of parsing annotations as dataframes were as follows (listed at #8, but conflated with schema rafactoring):

  1. Getting at the data was difficult previously difficult. JamsFrame gives a unified interface
  2. Modifying objects (eg, timing) was difficult. JamsFrame exposes array operations cleanly.
  3. Mixed data type storage
  4. Querying, slicing, missing data support, etc
  5. Explicit time-type checking / validation
  6. Easy conversion to numpy arrays
  7. Easy export as row-major (sparse) or column-major (dense) formats.

Is it worth it?

It's been a couple of years at this point, so where do things currently stand?

  1. The jamsframe interface definitely makes it easier to probe data in a consistent format. However, I think this owes more to the unified schema / namespace abstraction than the dataframe interface itself.
  2. Modifying annotation values / times is easy, thanks to the exposed math operations. But more complex (and common) operations like slicing or concatenation can be quite cumbersome. It appears that #147 is the result of stricter requirements on what you can do with in-place operations, and it may be that the only way to add a new observation row is to make a full copy of the existing frame. This is bad news.
  3. DataFrames are a big win for mixed type storage.
  4. DataFrames are technically a win for querying/slicing/etc, but I can't say it comes up very often. I had originally envisioned an interface that would make it easy to join annotations along a unified time grid, but that's never quite materialized, mostly due to lack of need. All of our viz / sonification / evaluation needs go through exporting to labeled intervals / event sequences and into mir_eval, so this use-case is seeming less compelling.
  5. The timedelta thing has been useful for validation, but it also adds a lot of complexity that doesn't seem strictly necessary. It seems to me we could get by with floats and range checks.
  6. Conversion to numpy arrays happens (I think) exclusively through to_interval_values(), which is kind of a chore to use. However, I don't think there's a need for anything fancier than that.
  7. Sparse or dense export has been a win, but it also requires some rather delicate logic / extension of DataFrame to JamsFrame to work consistently.

More generally, because the JamsFrame class extends a class from a different package, which is occasionally subject to change, it's a long-term maintenance liability. It exists for two basic reasons:

  1. To ensure data type consistency when serializing to json. This includes column ordering, which is not necessarily preserved by json dictionaries.
  2. To encapsulate sparse/dense conversion and some helper routines (interval_values, trim, slice, etc).

Note that JamsFrames are much more restricted than DataFrames: they only ever have the same four columns, two of which have fixed type. This makes me think we could implement something much simpler that still fills our needs but without the headache of pandas.

What would a replacement for jamsframe need?

I figure the last two can be readily accomplished by a filter applied to the row iterator and fed back into the constructor. In particular, if the storage is sorted, this can be done efficiently with early stopping, so we don't need to filter on the whole dataframe to get a small slice.

If we go forward on this, it might cause some headaches for upstream projects like muda, msaf, pumpp, etc. Probably a major version increment would make sense here.

bmcfee commented 7 years ago

After a little hacking, I think namedtuple + sortedlistwithkey gets us 80% of the way there:

In [1]: from sortedcontainers import SortedList, SortedListWithKey

In [2]: from collections import namedtuple

In [3]: Observation = namedtuple('Observation', ['time', 'duration', 'value', 'confidence'])

In [7]: x = Observation(0, 0.5, 'foo', None)

In [8]: L = SortedListWithKey(key=lambda d: d.time)

In [9]: L.append(x)

In [10]: y = Observation(0.25, 0.9, 'bar', None)

In [11]: L.append(y)

In [12]: L
Out[12]: SortedListWithKey([Observation(time=0, duration=0.5, value='foo', confidence=None), Observation(time=0.25, duration=0.9, value='bar', confidence=None)], key=<function <lambda> at 0x7f668879ad90>, load=1000)

...

In [20]: L.bisect_key_left(0)
Out[20]: 0

In [21]: L.bisect_key_left(-1)
Out[21]: 0

In [22]: L.bisect_key_left(1)
Out[22]: 2

In [23]: L.bisect_key_left(0.1)
Out[23]: 1

In [24]: L.bisect_key_right(0.1)
Out[24]: 1

So time index bisection does most of the heavy lifting for us. A thin wrapper on top of this could handle most of the existing interface.

justinsalamon commented 7 years ago

My first question would be: what's the motivation for dumping pandas in the first place? Is it primarily the fact that we'll have to keep updating JAMS as pandas evolves, or are there any other serious considerations? From your comments it seems like maintenance is the main reason?

On the one hand, I'm all for minimizing dependencies. On the other, we should consider the potential cons to making this switch:

Anyway, that's just off the top of my head. I guess another point is that if we replace the dataframe with a simple ad-hoc datastruture, and then realize down the line that we need more functionality, we risk finding ourselves re-implementing features that pandas already provides.

So, what do people think? Right now I'm not certain the pros of switching outweigh the cons, but could be convinced otherwise.

bmcfee commented 7 years ago

My first question would be: what's the motivation for dumping pandas in the first place? Is it primarily the fact that we'll have to keep updating JAMS as pandas evolves, or are there any other serious considerations? From your comments it seems like maintenance is the main reason?

Yes, the main issue is maintenance. But a secondary issue is overall design.

The JamsFrame object inherits from DataFrame, but adds restrictions (allowed fields, etc). I'm increasingly convinced that this was a bad idea. (Sorry!) In general, when dealing with objects, inheritance is good for adding functionality, not removing it, and the reverse is true for containment. It might have been a better design if Annotation had a private pd.DataFrame object within it that stored data, and then all access could be mitigated through the Annotation API. Alas, we didn't think of this.

The proposed change can be seen in two stages. 1) remove the inheritance of a remote object (DataFrame) in favor of a containment structure; 2) replace the DataFrame container with something simpler.

(1) is a design optimization, (2) is a maintenance optimization.

On the one hand, I'm all for minimizing dependencies.

As mentioned in #149 , we still have a pandas dependency by way of i/o, and I'm fine with that. However, it shouldn't be required for the core object hierarchy.

Core functions that make use of the pandas API such as trim and slice (and elsewhere?) will have to be updated. Just something to be aware of.

Yup, these were easy to change in #149.

Other libraries that are built on top of JAMS, such as scaper, would break (e.g. scaper/core.py). Do we have an idea of how many other projects out there this could break?

I suspect muda, scaper, and pumpp are the only ones. Everything else I know of doesn't dig too deep into the object hierarchy.

Having the data in a pandas dataframe is extremely useful when you want to do something that's not supported by the JAMS API and need to code it up yourself, because of the rich set of features already supported by the pandas API and because everyone's already familiar with it.

Is it really? I can't think of many examples of this, outside of data augmentation. And even in that case, I think it's better if we put more constraints on what can be done with and to an annotation, eg, by enforcing that observations are immutable and requiring users to create new annotations if they're going to modify things.

ejhumphrey commented 7 years ago

two quick thoughts:

There's a decent chance a move away from pandas will break a number of things that use it. I anticipate these will be trivial changes if we easily support transforming AnnotationData -> DataFrame, but I know personally I use to_interval_values() from JamsFrame all over the place. We may ruffle more feathers than expected, and I'd upvote thinking carefully about this.

Personally, I often find the 97% of pandas awesomeness outweighed by the 3% of frustrating non-intuitive design, and would be happy to wave it goodbye.

bmcfee commented 7 years ago

Agreed, but better to do it sooner than later. Also, anecdotally, I constantly find myself expecting to_interval_values to be an annotation method, not jamsframe.

Maybe we should add a wrapper at the Annotation level, and put a deprecation notice in the JamsFrame method to give folks a chance to adapt?

bmcfee commented 7 years ago

Closed by merging #149