Closed bmcfee closed 7 years ago
Okay, this PR is now functional in that all tests pass again. So if you never touched a JamsFrame
object directly in application code (and there are few reasons to do so), then everything should just work.
That said, the design is up for discussion though, particularly seeking input from @justinsalamon and @ejhumphrey.
The new implementation replaces the JamsFrame
object by an AnnotationData
object, which at the schema level covers both DenseObservation
and SparseObservationList
, though not using JObject
s because I don't think we should have sandbox support in the observation list.
Under the hood, AnnotationData
encapsulates the observations with an ordered (by time
) list of sparse observations. Observations are implemented as named tuples (time, duration, value, confidence)
, again not using JObject so that we can preserve slots and field ordering.
This namedtuple implementation also makes individual observations immutable, which I see as a Good Thing. If you want to modify observations, it's best to make a new annotation object and re-insert them after modification. Note that this is the reverse of how the DataFrame
implementation worked: the individual observations there were mutable (it's easy to modify entries in-place), but the container itself was much more rigid (adding and deleting observations was difficult). I think the mutable-container/immutable-observation model makes a lot more sense for the common cases (e.g., adding observations to an existing annotation).
You can iterate over observations in the usual way:
for obs in ann.data:
print(obs.time, obs.duration, obs.value, obs.confidence)
@justinsalamon and I kicked around the idea of having an export-to-dataframe
method, which should be pretty trivial to cook up. Similarly, we might want to consider adding other mir_eval-style converters, such as to_event_values
for instantaneous observations.
Pretty-printing is the last major hurdle, but I think we can get over that fairly easily.
well, first off, 💯 for a PR / RFC that deletes more code than it adds.
I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers
lower than pandas, and that's why it's a win?
question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?
I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers lower than pandas, and that's why it's a win?
I think we still need pandas as a dependency for loading lab/csv files (utils, import scripts, etc). But it's good to get the hacky dataframe dependencies out from under the jams object hierarchy.
Sortedcontainers is a much simpler project, and seems reasonably mature at this point.
question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?
It did, but that's not a huge deal. It's better IMO to make observations immutable and put a little more friction into annotation manipulation. Muda will need a little retrofitting, but it's probably about an afternoon's worth of work.
Note: coverage is down here because I added html rendering for AnnotationData
. I don't think unit tests are critical on that function.
Note: coverage is down here because I added html rendering for AnnotationData.
Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz? This could be nice in that you could use the variety of viz options offered by the pandas API for dataframes (e.g. peak(), head(), etc.)
Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz?
I don't think we should rely on it, but it's always an option. The dataframe converter is done anyway, so you can do
df = ann.data.to_dataframe()
df.head(10)
etc.
I thought some more about this PR last night, and I think I'm convinced that we should drop the separate AnnotationData
class, and promote most of its functionality up to the AnnotationData
level. The data
field will then be a SortedListWithKey
object, and the Annotation
object will handle all (de)serialization and management of observations.
As noted in the review threads, this would break some API for folks that expect the ann.data
object to behave in certain ways, but we're already doing that anyway. The result should be a slightly simpler API, eg:
ann.to_interval_values()
instead of
ann.data.to_interval_values()
etc.
Is there any reason not to do this within this PR?
This will likely break more things in libs that build on jams, but as you say if we break things might as well go all in if it's worth it.
If the data structure is completely hidden and only accesible via the API, we should make sure the API supports common operations e.g. checking if the annotation is empty.
The one possible caveat to this change might be that since an annotation also has other fields (metada, time, duration, sandbox), the lack of an explicit data object could get a tad confusing in certain cases, e.g. len(ann)=? print(ann)=? etc
If the data structure is completely hidden and only accesible via the API, we should make sure the API supports operations for e.g. checking if the annotation is empty.
Well, the data
field is specified in the schema as a list/array (actually, either a denseobservationlist or a sparseobservationlist), so we shouldn't make it entirely private.
The one possible caveat to this change might be that since an annotation also has other fields (metada, time, duration, sandbox), the lack of an explicit data object could get a tad confusing in certain cases, e.g. len(ann)=? print(ann)=? etc
I think it's reasonable to treat Annotation
as a collection of observations with some metadata on the side. Then we can define len(ann)
as the number of observations, iter(ann)
iterates over observations, and __contains__
tests if an observation belongs to the list (not that we really need that, but it doesn't hurt).
repr(ann)
currently returns a JObject-style repr string like <Annotation: data, annotation_metadata, ...>
, so we wouldn't necessarily need to change anything there. I wouldn't mind changing JObject's repr format overall, since it's not terribly informative for distinguishing between different objects of the same type, but that's a separate issue.
Amending my previous statement: I don't think we can consistently make Annotation
implement the Collection ABC because it would change the semantics inherited from JObject
for __len__
and __contains__
, which both apply to the object's __dict__
. OTOH there's no problem in just implementing __iter__
.
@ejhumphrey I think this one's stable for eyeballs again.
Coverage drop is primarily due to html rendering not being covered, but I'd prefer to push that to #93 and implement it across the board.
Couple thoughts, minor apologies if / when relevance is a stretch:
for idx, obs in ann.data.iterrows()
... is this a common pattern (or am I weird); and if it is common, are we throwing deprecation warnings for this usage? or is it gone outright?Not concerned with the coverage drop; does coveralls have a setting that says "don't worry about drops if the total coverage is over a certain level" (say, 98?)
I'd prefer to keep the strict coverage check in place, just so that we're aware of it at all times.
It has since occurred to me that all of my other projects using jams have things like for idx, obs in ann.data.iterrows() ... is this a common pattern (or am I weird); and if it is common, are we throwing deprecation warnings for this usage? or is it gone outright?
Gone outright: any previous access to the data
field is not guaranteed to work in general. We throw deprecation warnings for some things in 0.2.3, but it was too much of a mess to do it in full generality without also flagging internal access that was going to change anyway.
The idx, obs
pattern can be recuvered by
for idx, obs in enumerate(ann):
...
though there's almost never a good reason to want the index itself.
Hashed out in person with @ejhumphrey -- this one's good to go.
This is work in progress, but implements an alternate backend for annotation data.
TODO: