marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
186 stars 26 forks source link

alternate AnnotationData backend #149

Closed bmcfee closed 7 years ago

bmcfee commented 7 years ago

This is work in progress, but implements an alternate backend for annotation data.

TODO:

bmcfee commented 7 years ago

Okay, this PR is now functional in that all tests pass again. So if you never touched a JamsFrame object directly in application code (and there are few reasons to do so), then everything should just work. That said, the design is up for discussion though, particularly seeking input from @justinsalamon and @ejhumphrey.

The new implementation replaces the JamsFrame object by an AnnotationData object, which at the schema level covers both DenseObservation and SparseObservationList, though not using JObjects because I don't think we should have sandbox support in the observation list.

Under the hood, AnnotationData encapsulates the observations with an ordered (by time) list of sparse observations. Observations are implemented as named tuples (time, duration, value, confidence), again not using JObject so that we can preserve slots and field ordering.

This namedtuple implementation also makes individual observations immutable, which I see as a Good Thing. If you want to modify observations, it's best to make a new annotation object and re-insert them after modification. Note that this is the reverse of how the DataFrame implementation worked: the individual observations there were mutable (it's easy to modify entries in-place), but the container itself was much more rigid (adding and deleting observations was difficult). I think the mutable-container/immutable-observation model makes a lot more sense for the common cases (e.g., adding observations to an existing annotation).

You can iterate over observations in the usual way:

for obs in ann.data:
    print(obs.time, obs.duration, obs.value, obs.confidence)

@justinsalamon and I kicked around the idea of having an export-to-dataframe method, which should be pretty trivial to cook up. Similarly, we might want to consider adding other mir_eval-style converters, such as to_event_values for instantaneous observations.

Pretty-printing is the last major hurdle, but I think we can get over that fairly easily.

ejhumphrey commented 7 years ago

well, first off, 💯 for a PR / RFC that deletes more code than it adds.

I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers lower than pandas, and that's why it's a win?

question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?

bmcfee commented 7 years ago

I do actually like that this gets us away from pandas (maybe as an optional dependency?), but we only trade dependencies. Is the development velocity on sortedcontainers lower than pandas, and that's why it's a win?

I think we still need pandas as a dependency for loading lab/csv files (utils, import scripts, etc). But it's good to get the hacky dataframe dependencies out from under the jams object hierarchy.

Sortedcontainers is a much simpler project, and seems reasonably mature at this point.

question: didn't pandas provide some neat wins for muda? shifting annotations / observations in time and whatnot?

It did, but that's not a huge deal. It's better IMO to make observations immutable and put a little more friction into annotation manipulation. Muda will need a little retrofitting, but it's probably about an afternoon's worth of work.

bmcfee commented 7 years ago

Note: coverage is down here because I added html rendering for AnnotationData. I don't think unit tests are critical on that function.

justinsalamon commented 7 years ago

Note: coverage is down here because I added html rendering for AnnotationData.

Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz? This could be nice in that you could use the variety of viz options offered by the pandas API for dataframes (e.g. peak(), head(), etc.)

bmcfee commented 7 years ago

Would it be a good/bad idea, since we still have the pandas dependency anyhow, to do viz by exporting to a dataframe and then using the pandas built in viz?

I don't think we should rely on it, but it's always an option. The dataframe converter is done anyway, so you can do

df = ann.data.to_dataframe()
df.head(10)

etc.

bmcfee commented 7 years ago

I thought some more about this PR last night, and I think I'm convinced that we should drop the separate AnnotationData class, and promote most of its functionality up to the AnnotationData level. The data field will then be a SortedListWithKey object, and the Annotation object will handle all (de)serialization and management of observations.

As noted in the review threads, this would break some API for folks that expect the ann.data object to behave in certain ways, but we're already doing that anyway. The result should be a slightly simpler API, eg:

ann.to_interval_values()

instead of

ann.data.to_interval_values()

etc.

Is there any reason not to do this within this PR?

justinsalamon commented 7 years ago

This will likely break more things in libs that build on jams, but as you say if we break things might as well go all in if it's worth it.

If the data structure is completely hidden and only accesible via the API, we should make sure the API supports common operations e.g. checking if the annotation is empty.

The one possible caveat to this change might be that since an annotation also has other fields (metada, time, duration, sandbox), the lack of an explicit data object could get a tad confusing in certain cases, e.g. len(ann)=? print(ann)=? etc

bmcfee commented 7 years ago

If the data structure is completely hidden and only accesible via the API, we should make sure the API supports operations for e.g. checking if the annotation is empty.

Well, the data field is specified in the schema as a list/array (actually, either a denseobservationlist or a sparseobservationlist), so we shouldn't make it entirely private.

The one possible caveat to this change might be that since an annotation also has other fields (metada, time, duration, sandbox), the lack of an explicit data object could get a tad confusing in certain cases, e.g. len(ann)=? print(ann)=? etc

I think it's reasonable to treat Annotation as a collection of observations with some metadata on the side. Then we can define len(ann) as the number of observations, iter(ann) iterates over observations, and __contains__ tests if an observation belongs to the list (not that we really need that, but it doesn't hurt).

repr(ann) currently returns a JObject-style repr string like <Annotation: data, annotation_metadata, ...>, so we wouldn't necessarily need to change anything there. I wouldn't mind changing JObject's repr format overall, since it's not terribly informative for distinguishing between different objects of the same type, but that's a separate issue.

bmcfee commented 7 years ago

Amending my previous statement: I don't think we can consistently make Annotation implement the Collection ABC because it would change the semantics inherited from JObject for __len__ and __contains__, which both apply to the object's __dict__. OTOH there's no problem in just implementing __iter__.

bmcfee commented 7 years ago

@ejhumphrey I think this one's stable for eyeballs again.

Coverage drop is primarily due to html rendering not being covered, but I'd prefer to push that to #93 and implement it across the board.

ejhumphrey commented 7 years ago

Couple thoughts, minor apologies if / when relevance is a stretch:

bmcfee commented 7 years ago

Not concerned with the coverage drop; does coveralls have a setting that says "don't worry about drops if the total coverage is over a certain level" (say, 98?)

I'd prefer to keep the strict coverage check in place, just so that we're aware of it at all times.

It has since occurred to me that all of my other projects using jams have things like for idx, obs in ann.data.iterrows() ... is this a common pattern (or am I weird); and if it is common, are we throwing deprecation warnings for this usage? or is it gone outright?

Gone outright: any previous access to the data field is not guaranteed to work in general. We throw deprecation warnings for some things in 0.2.3, but it was too much of a mess to do it in full generality without also flagging internal access that was going to change anyway.

The idx, obs pattern can be recuvered by

for idx, obs in enumerate(ann):
    ...

though there's almost never a good reason to want the index itself.

bmcfee commented 7 years ago

Hashed out in person with @ejhumphrey -- this one's good to go.