Closed bmcfee closed 7 years ago
After a little hacking, I think namedtuple + sortedlistwithkey gets us 80% of the way there:
In [1]: from sortedcontainers import SortedList, SortedListWithKey
In [2]: from collections import namedtuple
In [3]: Observation = namedtuple('Observation', ['time', 'duration', 'value', 'confidence'])
In [7]: x = Observation(0, 0.5, 'foo', None)
In [8]: L = SortedListWithKey(key=lambda d: d.time)
In [9]: L.append(x)
In [10]: y = Observation(0.25, 0.9, 'bar', None)
In [11]: L.append(y)
In [12]: L
Out[12]: SortedListWithKey([Observation(time=0, duration=0.5, value='foo', confidence=None), Observation(time=0.25, duration=0.9, value='bar', confidence=None)], key=<function <lambda> at 0x7f668879ad90>, load=1000)
...
In [20]: L.bisect_key_left(0)
Out[20]: 0
In [21]: L.bisect_key_left(-1)
Out[21]: 0
In [22]: L.bisect_key_left(1)
Out[22]: 2
In [23]: L.bisect_key_left(0.1)
Out[23]: 1
In [24]: L.bisect_key_right(0.1)
Out[24]: 1
So time index bisection does most of the heavy lifting for us. A thin wrapper on top of this could handle most of the existing interface.
My first question would be: what's the motivation for dumping pandas in the first place? Is it primarily the fact that we'll have to keep updating JAMS as pandas evolves, or are there any other serious considerations? From your comments it seems like maintenance is the main reason?
On the one hand, I'm all for minimizing dependencies. On the other, we should consider the potential cons to making this switch:
Anyway, that's just off the top of my head. I guess another point is that if we replace the dataframe with a simple ad-hoc datastruture, and then realize down the line that we need more functionality, we risk finding ourselves re-implementing features that pandas already provides.
So, what do people think? Right now I'm not certain the pros of switching outweigh the cons, but could be convinced otherwise.
My first question would be: what's the motivation for dumping pandas in the first place? Is it primarily the fact that we'll have to keep updating JAMS as pandas evolves, or are there any other serious considerations? From your comments it seems like maintenance is the main reason?
Yes, the main issue is maintenance. But a secondary issue is overall design.
The JamsFrame
object inherits from DataFrame
, but adds restrictions (allowed fields, etc). I'm increasingly convinced that this was a bad idea. (Sorry!) In general, when dealing with objects, inheritance is good for adding functionality, not removing it, and the reverse is true for containment. It might have been a better design if Annotation
had a private pd.DataFrame
object within it that stored data, and then all access could be mitigated through the Annotation
API. Alas, we didn't think of this.
The proposed change can be seen in two stages. 1) remove the inheritance of a remote object (DataFrame
) in favor of a containment structure; 2) replace the DataFrame
container with something simpler.
(1) is a design optimization, (2) is a maintenance optimization.
On the one hand, I'm all for minimizing dependencies.
As mentioned in #149 , we still have a pandas dependency by way of i/o, and I'm fine with that. However, it shouldn't be required for the core object hierarchy.
Core functions that make use of the pandas API such as trim and slice (and elsewhere?) will have to be updated. Just something to be aware of.
Yup, these were easy to change in #149.
Other libraries that are built on top of JAMS, such as scaper, would break (e.g. scaper/core.py). Do we have an idea of how many other projects out there this could break?
I suspect muda, scaper, and pumpp are the only ones. Everything else I know of doesn't dig too deep into the object hierarchy.
Having the data in a pandas dataframe is extremely useful when you want to do something that's not supported by the JAMS API and need to code it up yourself, because of the rich set of features already supported by the pandas API and because everyone's already familiar with it.
Is it really? I can't think of many examples of this, outside of data augmentation. And even in that case, I think it's better if we put more constraints on what can be done with and to an annotation, eg, by enforcing that observations are immutable and requiring users to create new annotations if they're going to modify things.
two quick thoughts:
There's a decent chance a move away from pandas will break a number of things that use it. I anticipate these will be trivial changes if we easily support transforming AnnotationData -> DataFrame, but I know personally I use to_interval_values()
from JamsFrame all over the place. We may ruffle more feathers than expected, and I'd upvote thinking carefully about this.
Personally, I often find the 97% of pandas awesomeness outweighed by the 3% of frustrating non-intuitive design, and would be happy to wave it goodbye.
Agreed, but better to do it sooner than later. Also, anecdotally, I constantly find myself expecting to_interval_values
to be an annotation method, not jamsframe.
Maybe we should add a wrapper at the Annotation
level, and put a deprecation notice in the JamsFrame
method to give folks a chance to adapt?
Closed by merging #149
Opening this up for discussion again: what do people think about dropping pandas from jams? I'm not yet advocating this position, but I think it's worth thinking through.
Note, this would not affect the schema, only the data structures used in the python implementation to parse it.
Why do we use pandas?
Initially, the arguments in favor of parsing annotations as dataframes were as follows (listed at #8, but conflated with schema rafactoring):
Is it worth it?
It's been a couple of years at this point, so where do things currently stand?
to_interval_values()
, which is kind of a chore to use. However, I don't think there's a need for anything fancier than that.More generally, because the
JamsFrame
class extends a class from a different package, which is occasionally subject to change, it's a long-term maintenance liability. It exists for two basic reasons:Note that JamsFrames are much more restricted than DataFrames: they only ever have the same four columns, two of which have fixed type. This makes me think we could implement something much simpler that still fills our needs but without the headache of pandas.
What would a replacement for jamsframe need?
value
andconfidence
fields retaining the same index structure. This may take some thought.Annotation
object, since it has access to the namespace. By default, I would serialize row-major (sparse).I figure the last two can be readily accomplished by a filter applied to the row iterator and fed back into the constructor. In particular, if the storage is sorted, this can be done efficiently with early stopping, so we don't need to filter on the whole dataframe to get a small slice.
If we go forward on this, it might cause some headaches for upstream projects like
muda
,msaf
,pumpp
, etc. Probably a major version increment would make sense here.