Schema refactor - Githubissues

bmcfee commented 8 years ago

Rehashing #40 after a conversation with @ejhumphrey

There are good arguments for splitting the JAMS schema into smaller pieces that can be shared and repurposed. Specifically, a database (eg, a mongodb key-value store) for managing jams collections could be more reasonable structured (and easily searchable) if the database contains individual annotation objects (indexed by track id) rather than full JAMS objects.

I propose that we refactor the jams schema so that annotations can exist independently of the JAMS file format. Of course, the JAMS file format will still use annotation definitions, so there should be no observable difference in the way JAMS files work*; put another way, the API for JAMS files stays the same, and all the changes would be under the hood.

Digging in a bit more, the current schema looks like:

jams_schema
`- JAMS
   `- FileMetadata
   |  `- [more stuff]
   `- Annotations
   |  `- [more stuff]
   `- Sandbox

and the refactored schema might look like:

jams_common
`- Sandbox

jams_annotation
`- Annotations
   `- [more stuff]

jams_metadata
`- FileMetadata
   `- [more stuff]

jams_file
`- JAMS
   `- jams_metadata.FileMetadata
   `- jams_annotation.Annotations
   `- jams_common.Sandbox

What do folks think?

To make this happen, we'd have to get a better handle on json-schema inheritance, but I think it's totally possible.

We might have to tweak the schema id's, which might require a slight modification to the spec. Not sure about this yet.

ejhumphrey commented 8 years ago

More related to this than worth spawning a new issue: I'd like to revisit / upvote a conversation about how identifiers / named entities are referenced in JAMS. For example, I'd like to tag a single annotation as being produced by some unique identifier, such that I can search a collection for all annotations performed by the same entity (human or algorithm). We've got the annotator dict, but it's a little too unconstrained to encourage any convention.

bmcfee commented 8 years ago

I'm not sure that fits under the scope of JAMS per se; remember the headaches about filenames in #5? We eventually decided that that's better handled at the application level -- for better or worse. I suspect that indexing annotation sources will have similar difficulties.

OTOH, if we do want to add support for foreign-key indexing (for tracks, annotators, etc), maybe it's worth reopening that discussion?

urinieto commented 8 years ago

Could we simply add a new identifier field in the annotator dictionary that is basically a unique hash produced by the annotator name, email, affiliation, etc?

On Thu, Aug 18, 2016 at 9:03 AM, Brian McFee notifications@github.com wrote:

I'm not sure that fits under the scope of JAMS per se; remember the headaches about filenames in #5 https://github.com/marl/jams/issues/5? We eventually decided that that's better handled at the application level -- for better or worse. I suspect that indexing annotation sources will have similar difficulties.

OTOH, if we do want to add support for foreign-key indexing (for tracks, annotators, etc), maybe it's worth reopening that discussion?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marl/jams/issues/92#issuecomment-240771548, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhisZBuTv6usFCrjzvjP9YFyMB1CaEqks5qhIJfgaJpZM4G004O .

ejhumphrey commented 8 years ago

I don't want to necessarily tell users what the namespace should be, but I think we could benefit from some standardization. On Aug 18, 2016 12:24, "Oriol Nieto" notifications@github.com wrote:

Could we simply add a new identifier field in the annotator dictionary that is basically a unique hash produced by the annotator name, email, affiliation, etc?

On Thu, Aug 18, 2016 at 9:03 AM, Brian McFee notifications@github.com wrote:

I'm not sure that fits under the scope of JAMS per se; remember the headaches about filenames in #5 https://github.com/marl/jams/issues/5? We eventually decided that that's better handled at the application level -- for better or worse. I suspect that indexing annotation sources will have similar difficulties.

OTOH, if we do want to add support for foreign-key indexing (for tracks, annotators, etc), maybe it's worth reopening that discussion?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marl/jams/issues/92#issuecomment-240771548, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADhisZBuTv6usFCrjzvjP9YFyMB1CaEqks5qhIJfgaJpZM4G004O .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/marl/jams/issues/92#issuecomment-240777692, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4iq-7ogkaDZh5FztT8OuNA18mVUdxhks5qhIclgaJpZM4G004O .

bmcfee commented 8 years ago

Maybe go rosetta-style? Let identifiers be a list of strings of the form id_space:id_string?

That will at least validate for syntax. If you want semantic validation, that's up to a separate indexing structure that should live outside of jams.

For example, the SALAMI annotators could be identified by salami:0001 or somesuch. Similarly for annotation tools (org:software:version -> qmul:sonic-visualiser:1.2, qmul:tony:2.0, jku:madmom:0.14.1, etc), and filenames could just be standard urls.

marl / jams

Schema refactor #92