marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
181 stars 25 forks source link

JAMS beyond music? #24

Open bmcfee opened 9 years ago

bmcfee commented 9 years ago

Just opening up a separate thread here (rather than the already bloated #13): is it worth considering designing JAMS to be extensible into domains outside of music/time-series annotation?

I think the general architecture is flexible enough to make this possible with roughly zero overhead, and it might be a good idea.

From what I can tell, all that we'd have to do is restructure the schema a little so that "*Observation" is slightly more generic. We currently define two (arguably redundant) observation types that both encode tuples of (time, duration, value, confidence). It wouldn't be hard to extend this into multiple observation forms, say for images with bounding-box annotations, we would have (x, x_extent, y, y_extent, value, confidence). For video, we would have (x, x_extent, y, y_extent, t, duration, value, confidence), etc.

Within the schema, nothing would really change, except that we change "DenseObservation" to "DenseTimeObservation" (and analogous for Sparse), and then some time down the road, allow other observation schema to be added.

I don't think we need to tackle this for the immediate (next) release, except insofar as we can design to support it in the future in a backwards-compatible way.

Opinions?

urinieto commented 9 years ago

Yes, I thought of the "image" application of JAMS as well. We should do a bit of research to know what people in image/video processing use to annotate their datasets. But I like the idea of extending this, at least for a future release after the first "official" one.

justinsalamon commented 9 years ago

I think a lower-hanging fruit would be non-music audio datasets (e.g. environmental sounds). I'm probably biased, but I feel this is an area where the need for annotated datasets is growing rapidly and would require minimal (or zero?) additional work to accommodate, right? Oh, and there's speech too...

ejhumphrey commented 9 years ago

Forgive overlap with other issues that escape my memory, but it seems lyrics fall into this conversation too, no? On 11 Feb 2015 11:29, "Justin Salamon" notifications@github.com wrote:

I think a lower-hanging fruit would be non-music audio datasets (e.g. environmental sounds). I'm probably biased, but I feel this is an area where the need for annotated datasets is growing rapidly and would require minimal (or zero?) additional work to accommodate, right? Oh, and there's speech too...

— Reply to this email directly or view it on GitHub https://github.com/marl/jams/issues/24#issuecomment-73912113.

bmcfee commented 9 years ago

I think a lower-hanging fruit would be non-music audio datasets (e.g. environmental sounds).

It depends on what the annotations look like, but I would expect most of this data to look like tag_* annotations, just like we already support. The point I was getting originally is not so much the domain of the data, but the way in which the extent of an annotation is encoded.

Forgive overlap with other issues that escape my memory, but it seems lyrics fall into this conversation too, no?

We already support lyrics with the current schema.. afaik, nothing needs to change?

bmcfee commented 7 years ago

I was thinking about this today while talking to some folks working on speech / general audio. One of the issues there is that our metadata schema might not be appropriate for non-music annotations.

I think this could issue actually be merged with #98 / a schema refactor that promotes all jams classes to top-level definitions.

The reasoning here is that if we move FileMetadata up a level, we can then have that as a base class that's inherited by things like MusicMetadata, SpeechMetadata, etc. The JAMS schema would then allow an annotation to have metadata belonging to any of those particular formats. This is a pretty minimal change, and would be backward-compatible, and open JAMS up to a broader class of applications.

Similarly, we could abstract the Observation type into things like Observation1D and Observation2D, which would have (time, duration) (time) and (x, x_extent, y, y_extent) (spatial) localization fields. This again would broaden the utility of JAMS beyond music/audio, and make it applicable for things like images and video, without much effort on our end.

What do folks think? @ejhumphrey @justinsalamon ? EDIT tagging @stevemclaugh

bmcfee commented 7 years ago

Thinking about this more: a complication here would be that dynamic reconstruction of the corresponding jams class for alternate metadata schemas could get tricky.

We get around this (oneOf types) in annotation (dense vs sparse observation) by using the same internal data store for both types (so it doesn't matter when loading), and having an extra field in the namespace definition that determines which class to use (when saving). I'd like to avoid generalizing this kind of hack to bigger class definitions; maybe there's a way to probe the schema validator to know which part of the schema it's catching when the string input is validated on load?

bmcfee commented 7 years ago

The above might be resolved if we specify a type mapping for all schema objects: https://python-jsonschema.readthedocs.io/en/latest/validate/#validating-with-additional-types

ejhumphrey commented 7 years ago

I very much agree about generalization, and have wondered this since my days hacking away at OMR (which was almost jamsy). I wonder if something like CrowdFlower would be interested in collab'ing for their image annotator...