Define audio synchronization in absence of graphical score

joeberkovitz commented 6 years ago

@adrianholovaty raised the issue at the Anaheim CG meeting Jan. 26, 2018, that it would be nice to leverage GMNX's musical time synchronization features for audio performances even in the absence of an SVG graphical score. An application could use the sync info to display cursors in relation to a CWMNX rendering, for instance.

It does seem possible to do this by more-or-less allowing GMNX to omit the graphics but still include performance content. That is a bit ugly, and perhaps this spurs some re-examination of whether performance content should be packaged in the same MNX score body as graphics -- if the chunks were separable, then performance content could be standalone in a cleaner way.

notator commented 6 years ago

The written minutes of the meeting say:

Adrian Holovaty asked how we could get performance synchronization points from GMNX into CWMNX? The synchronization feature in GMNX would be useful for applications that do know the semantics of music notation. Joe asked Adrian to file an issue so we can address this.

Adrian seems to be assuming that the SVG can contain no music-semantical information. That's not the case. I'm proposing that the SVG elements in GMNX should be strongly classed, and that there should be a standard container structure for each defined notation type. For example, the CWMN flavour of GMNX would have containers that are SVG <g> elements having class="system", class="staff" etc. Given that information, it should be possible to parse the SVG into a CWMN editor, or some other application that understands the defined classes.

If, as is also possible, the SVG contains standardized cursor information, then that can also be used by the importing application.

Hope that helps.

adrianholovaty commented 6 years ago

@notator Such a system would work well...if you have SVG! :-) For my own application, I have synchronization data and symbolic music notation — and SVG isn't involved anywhere. I don't want to have to build a whole SVG-generation engine merely to be able to export synchronization data; that would be a non-starter.

It would be nice for the spec to support synchronization data for symbolic notation (CWMNX), rather than requiring GMNX.

Nothing complex is necessary. The format Soundslice uses is a simple array of syncpoints, where each syncpoint is in this format:

[bar_index, timecode, percent_into_bar]

See Soundslice API documentation here for more details.

notator commented 6 years ago

Comment deleted 05.04.2018

notator commented 6 years ago

@adrianholovaty The synchronizing feature works in my Aprés un rêve example because it is synchronizing particular performances (instantiations) with a particular score (instantiation). Exact synchronization like that can't be done without having instantiations. The instantiations in this case are .mp3s and an .svg file. If your application deals with instantiations, then I think you could use a similar approach without having to convert everything to SVG.

Apropos converters: It should be possible to convert MusicXML to cwmnGMNX by loading the MusicXML into a CWMN music editor (to create an instantiation) and then exporting cwmnGMNX. It should also be possible to do that in a simpler, standard converter that uses standard instantiation values.

joeberkovitz commented 6 years ago

@adrianholovaty Here's an initial suggestion to kick off the active review of this issue, which the chairs think is an important one.

First, let's acknowledge that we should never assume that all applications will use SVG as a graphical engine. Even on the web platform itself, many web applications use Canvas 2D which is an equally legitimate web graphics API, with its own strengths. (Since I designed and wrote the SVG-based layout engine for Noteflight, this is hardly a partisan argument!)

Second, the audio/performance dimension of MNX-Generic is just as legitimate an "instance" as the graphical dimension. Put humorously, the "G" in "GMNX" didn't necessarily stand for "Graphics".

Taking both into account, here's a starter proposal: Make the <score-view> element of MNX-Generic optional. This permits MNX-Generic instances to omit graphics altogether if they wish. (It also means that MNX-Generic documents lacking graphics would not be renderable in a standalone fashion, a possible downside.)

Note that the <performance-audio> and <performance-data> elements already establish a piecewise linear mapping from performance time to "notated time" via the <performance-tempo> feature. If notated time is expressed in CWMN note values (this decision is left up to the encoder), applications like Soundslice can directly sync a displayed score with external media or performance data, by figuring out where in the score a given notated time occurs. There would not be any reference to "bar numbers" as in Soundslice's preferred data structure, but, hey, we're trying to keep MNX-Generic generic :-)

adrianholovaty commented 6 years ago

@joeberkovitz Thanks for picking the discussion back up. Honestly, I think this proposal is too heavyweight and abstract for my own purposes with Soundslice. I couldn't see us using it.

In addition, I worry that removing <score-view> would severely water down the usefulness of MNX-Generic. As it stands, it's conceptually clear and concise — "a wrapper format linking SVG with audio/performance data" (if I'm indeed understanding it correctly). If we made the SVG part optional, that would make it much more complex to reason about. Please don't do this on my account! :-)

I believe a format addressing the particular problem of "how to sync performances to symbolic notation" would best be addressed in some other way.

joeberkovitz commented 6 years ago

Thanks @adrianholovaty -- this is helpful. Do you have a sense of what this "other way" might look like?

adrianholovaty commented 6 years ago

@joeberkovitz Sure, a good starting point would be the format we use for Soundslice (see "Syncpoint data format" in the docs here).

The main trickiness is that a performance has "expanded" bars, whereas symbolic notation has unexpanded bars. For example, in a two-bar score with a repeat barline at the end, there are two unexpanded bars and four expanded bars — a performer plays four bars total, whereas the score displays only two bars. Hence there's an opportunity for a distinct syncpoint for each of the four expanded bars. Syncpoints operate in the realm of expanded bars.

But before I go overboard with more on the finer points of syncpoints... To be clear, I think we're talking about a completely separate thing from MNX-Generic. It would be MN-SYNC or something. :-) Is this something worth considering, or is it out of scope?

joeberkovitz commented 6 years ago

@adrianholovaty To be honest, I'm first looking to see whether there is an opportunity to make MNX-Generic actually work for this case, in a way that's not painful for you and others with similar use cases. I don't want to start out with a perspective that it can't work. It might not work, of course, but before considering any scope expansion, I feel like it's important to explore the MNX-Generic angle, as long as it's a natural extension of the "instantiation" concept and not something forced.

I've read through your syncpoint data format docs and will try to respond shortly with a proposal. Then we can see where it falls short.

joeberkovitz commented 6 years ago

Here's one way to make this work, by adding one new element to MNX-Generic.

First, as previously suggested, make the graphics in MNX-Generic optional.

Then, define a new element <performance-sync start="{realtime}" notated="{notated-time}" semantics="{element-id}"/>. This can occur anywhere within the <performance-mapping> element, so it's associated with a recording, of which there can be any number. In the case of CWMN, element-id can refer to either a measure or an event symbol in the associated MNX-Common encoding. So you can have something like this:

<performance-audio>
  <performance-audio-media url="recording.mp4"/>
  <performance-mapping>
    <performance-sync start="1.234" notated="0.00" semantics="measure1"/>
    <performance-sync start="1.723" notated="0.25"/>
    <performance-sync start="1.913" notated="1" semantics="measure2"/>
    <performance-sync start="2.611" notated="2" semantics="measure3"/>
   ...
  </performance-mapping>
</performance-audio>

In the case of CWMN, notated time would be in whole notes from the start of the score. This is to avoid depending on the CWMN-centric notion of measures (and the spec just speaks of an abstract "notated time" dimension with no particular units). But the above still allows one to do intra-measure sync points, because one's only interested in the difference between successive time values, not so much in their absolute values. In fact, the references to measures above are somewhat optional, since the offset into the score is enough to completely determine the measure.

I didn't include any sync points to events, but those are easy enough to imagine: semantics just points to an event rather than a measure.

Obviously, it's also the case that this doesn't require any changes to MNX-Common.

notator commented 6 years ago

Maybe I'm wrong, but if semantics pointed to an event, then I think you could leave out the notated attribute. That would completely remove any CWMN-centricity from MNX-Generic. :-)

joeberkovitz commented 6 years ago

There's really nothing CWMN-centric about "notated time" in MNX-Generic. It's just a generic time dimension, in whatever units the notational system prefers. It could be South Indian tala or Ikuta time-units. It could even be seconds, for a score notated as a pure time graph (in which case the mapping to notated is still useful, since even time graph scores are not performed with perfect accuracy). For audio that links back to an MNX-Common score, though, it makes sense to use MNX-Common's unit of a whole note, and treat this as an offset into the score. In fact, that makes semantics optional, at least for measures (although not for events).

clnoel commented 6 years ago

@joeberkovitz In general, as long as an encoder can find SOME set of numbers that can be used to correlate two audio performances in a piece-wise linear fashion, it doesn't matter what the units are. I think that the "number of whole notes into the piece" for the units of the notated attribute does make sense for MNX-Common pieces.

The use of the semantics is only necessary if we are trying to broaden this so that it will work with a visual MNX-Common score as well. If we are doing that keep in mind (as has been pointed out in other locations) that there is a many-to-many mapping between performance and events, so having a single event-id in the semantics attribute is insufficient, if only because each part has its own set of events. Measure # as a semantic reference is certainly inadequate, given pick-up measures, time signature changes, and multi-metric pieces.

notator commented 6 years ago

Sorry, I was confusing the hypothetical performance-synch with performance-region.

@adrianholovaty said

To be clear, I think we're talking about a completely separate thing from MNX-Generic. It would be MN-SYNC or something. :-)

I think this thread may be a bit confused, because I was (until a couple of weeks ago) trying to redefine GMNX (which has now become MNX-Generic), and was using the term in connection with my Après un rêve demo. So @adrianholovaty may originally have been using the term in my sense rather than the chair's. I'm no longer trying to redefine MNX-Generic, but am instead proposing a new profile for MNX-Common (in #95). That would, I think, be a better solution to @adrianholovaty's problem than trying to extend MNX_Generic. Sorry for the confusions.

clnoel commented 6 years ago

I think I have been confused by the title of this topic. Are we talking NO graphical score, where we are trying to find the links between two audio files; or are we talking trying to link an MNX-Generic file to a semantic score of some flavor? Although there might be similar solutions, those are very different use cases.

joeberkovitz commented 6 years ago

@clnoel We're talking about linking any number of audio files (in an MNX-Generic document) to a semantic score (in an MNX-Common document). The point being that an application may be using dynamic rendering of MNX-Common in conjunction with audio playback -- really a pretty important use case.

As a side effect of doing that, one also can discover linkage between any of the audio files, but that's not the motivation behind this issue.

joeberkovitz commented 6 years ago

The co-chairs discussed this today and we're aligned on the approach of supporting this through new encoding elements in MNX-Common (not MNX-Generic) which map real time ranges (whose endpoints are clock time in the performance) to notated time ranges (whose endpoints are measure IDs or numbers, with beat offsets). The encoding itself still needs definition, but it's fairly simple and runs along the lines of what @adrianholovaty originally suggested.

This only needs to be supported for syncing recorded audio, since an application can generate synthesized audio directly from MNX-Common (which in turn can include any desired performance data via <interpret>), and the sync points of synthesized audio are obviously already known by the app doing the synthesis.

Note that such synchronization need not reference events in any way, since the measure and time offsets of any event in an MNX-Common document is already known.

(Edit: I should have mentioned, these mappings would of course have to be per-recording, not global to the score)

joeberkovitz commented 6 years ago

Here is a more concrete proposal. The plan is to make a pull request after an opportunity for more feedback.

Modify MNX-Common to introduce a new element, <score-audio>, which specifies both a set of audio media files and their temporal synchronization with the score. The element is a child of a <global> element, which defines the mensural scheme to which these media are synchronized.
One or more <score-audio-media> children of <score-audio>. Each element defines the audio media file that makes up one "logical track" of the recording. An optional part='...' attribute can specify the ID of a <part> element which is the part audible in a given logical track. All logical tracks are synchronized with each other. (This feature satisfies a requirement to be able to control a mix of separate recorded parts -- multichannel audio doesn't work well for this since it's geared to channel layouts like mono, stereo, 5.1, etc.)
One or more <score-audio-sync> children of <score-audio>. Each element defines a correspondence between a time attribute (offset into the audio in seconds), a measure attribute (a measure index within the parent <global>), and a position attribute (metrical position within the measure, in whole-note units). These correspondences are essentially the same information as described by @adrianholovaty in https://github.com/w3c/mnx/issues/67#issuecomment-361926742 . At least two <score-audio-sync> elements must be present. These may be present with any desired degree of granularity to describe changes in tempo; these changes are instantaneous, but for visual sync purposes such an approximation works very well, even with a fairly course time quantum.

A simple example with only a single logical track:


<mnx-common>
  <global>
    <score-audio>
       <score-audio-media src="recording.mp4"/>
       <score-audio-sync time="1.43" measure="1" position="0"/>
       <score-audio-sync time-"2.43" measure="2" position="0"/>
       <score-audio-sync time-"2.98" measure="2" position="0.5"/>   ...slowing down...
       <score-audio-sync time-"3.53" measure="3" position="0"/>
       ...et cetera...
    </score-audio>
    <measure>...</measure>
    <measure>...</measure>
    <measure>...</measure>
     ...
  </global>
  <part>...</part>
</mnx-common>

clnoel commented 6 years ago

I don't like having these be a child of <global>, since that makes it really hard to sync multi-metric pieces, or other pieces with more than one <global> element. It should be a child of <mnx-common> directly.

If we want to sync using measure counts rather than total whole-note count into the piece, we should reference which global we are using as our reference in either the <score-audio> or (preferably) the <score-audio-sync> elements, using an id. The id is optional if there is only one <global>. Since we haven't really defined how we're intending to sync semantics through more than one <global>, I can't go farther into describing how we'll sync audio.

Also, explicitly, there should be allowed to be more than one <score-audio> element in the <mnx-common> element, each for a different recording. (I like have several <score-audio-media> elements to represent different tracks of the same "recording". Good thought!)

joeberkovitz commented 6 years ago

@clnoel: given that we are identifying measures and offsets for sync points, isn't it necessary to decide which <global> element's measure definitions will apply? Having the score audio material be a child of <global> is just a way to force this decision. One could also do it with an element ID, but why is that better? I'm having trouble seeing why syncing multimetric music is any harder as a result of being forced to use sync points in only one of the <global> metrical schemes. Any given sync point expressed in one of the metrical schemes, can also be expressed in terms of any of the others.

BTW, in the above proposal, it is permissible to have any number of <score-audio> elements as you requested. If we moved them to <mnx-common>, this would still make sense.

adrianholovaty commented 6 years ago

The <score-audio> proposal sounds mostly good to me. Some responses:

I'd suggest being explicit in the spec that <score-audio-sync> elements must be in chronological order. Lazy developers will likely assume that anyway — so guaranteeing chronological order would remove the risk of these lazy developers parsing a document and neglecting to reorder manually.
I believe position should be a percentage into the bar, instead of whole-note units. Two reasons:
- Using whole-note units introduces the possibility that a position lies outside the bar (e.g., position=1.25 in a 4/4 bar). This is much harder to validate than a simple percentage (which can only be between 0 and 1). To validate whole-note positions, you need access to the notation data. (Note that if we decide to use percentages, we'd have to point out that "100% into bar 1" is the same thing as "0% into bar 2" — an edge case developers will need to deal with.)
- If a notation author changes the time signature from 3/4 to 6/8, the <score-audio> positions (as whole-note units) would all need to be recalculated. If they were percentages, they wouldn't need to be recalculated. (Not a strong reason, to be sure; more of a secondary reason.)
These are the reasons I opted for percentages in my own implementation of syncpoints for Soundslice, which has served us well for just over four years now.
How does this proposal handle the concept of "expanded bars" as mentioned in my comment above? This is crucial to handle, and it's non-trivial because some repeat situations may not have consistent behavior across applications. An example: "During the second pass of a D.C. al Coda, if you encounter a 'Da Coda' within repeat bars, do you honor the repeat bars or do the Da Coda jump the first time through?"
Finally, if we want to get super abstract and truly handle any musical situation, we should consider enabling separate syncpoints for every performer within a single recording. There are subtle differences in timing between the members of an ensemble: for example, a bass player could be playing ahead of the beat, a vocalist could be playing behind the beat, etc. The concept of "The Beat" could be considered a weighted average of all the musicians' individual sense of the beat. You could imagine a synced playback UI that highlights each part's notes differently, according to the instrument-specific syncing (and this is indeed something I've considered for Soundslice). Of course, I believe this is highly impractical and over-engineered for 99% of use cases.

joeberkovitz commented 6 years ago

@adrianholovaty thanks for the points! Coming back to this (I've been on vacation for a bit) I have a few responses and questions for you if that's OK -- would love to wrap this up...

Yes, chronological order sounds like a good constraint.
You are arguing that percentages make validation easier since one doesn't need the notation data, but it makes interpretation harder for exactly the same reason: one needs the same notation data that validation no longer requires. We already have many quantities whose validation already requires a measure length, so using a common one like position (used in all directions among other places) still seems better to me than making a special deviation and using percentage/proportion in this case. As an illustration of my point, would we want to see all directions placed in the measure using percentages instead of whole-note fractional offsets?
I agree of course that form instructions are not honored consistently across performances and applications. But rather than trying to complicate this issue with the form concerns that are already raised in #99, there's a simpler solution that doesn't interact with form at all. Let's require that for every point where there is a jump in the performance (where "jump" means a transition to a measure that is not the immediate notated successor of the previous one in the score, ignoring all form instructions), a syncpoint must exist to reference the start of the measure that's being jumped to. This allows syncing to independently specify the musical form a particular recording follows, without invoking all the machinery of repeats, DC, DS, etc. And in truth many recordings (jazz in particular) freely mess around with form in a way that doesn't lend itself to anything but a literal description of what was played.
I agree on substance, but I think we could add this on by permitting multiple syncing schemes per recording in the future without complicating what we have now.

joeberkovitz commented 6 years ago

This should have been retagged as an MNX-Common issue following https://github.com/w3c/mnx/issues/67#issuecomment-385465230, my apologies for any confusion.

clnoel commented 6 years ago

@adrianholovaty , addressing Point 4, I think that that should not be part of v1, but it should be relatively simple to add later by creating multiple sync lists, and having them reference particular parts instead of the entire global.

More immediately (Point 2), I am inclined toward whole-note units, but I don't understand one of your objections. I've looked at it several times, but I can't figure why a switch from 3/4 time to 6/8 time would require refiguring the whole-note values. Aren't the eighth notes in each measure of both of those time signatures at 0, .125, .25, .375, .5, and .625 either way? After all, we're not talking beat-units, but rather whole-note units.

About Point 3, and syncing non-linear scores. Part of the problem is that in order to track properly, we need to know when we are jumping. This does NOT necessarily happen at measure boundaries, when we are talking matching to recordings. We definitely need mandatory sync points that define jumps, but we might need to think about making it a bit more complicated. My preference is a "jump-from" parameter set of some kind - like jump-from-measure="2" jump-from-position="1" - This will let programs that are not just highlighting notes, but are instead using a cursor of some kind to smoothly slide the cursor from the last sync point before the jump up to the jump-from point, before jumping to the jump-to point. There are other ways to do this same thing, like allowing two sync points to have the same time, to indicate a discontinuity, and I'm open to those kind of options as well.

adrianholovaty commented 6 years ago

@clnoel Regarding Point 2 — yes, you're absolutely right! I disavow that reason. :-) But I still stand by the other reason (easier validation).

joeberkovitz commented 6 years ago

@adrianholovaty I asked a while back if you could explain why percentages don't make interpretation harder, to exactly the same extent that they make validation easier... can you respond? Thanks!

joeberkovitz commented 6 years ago

Having waited a fair bit for comments to emerge, I've created a pull request (see above) detailing a specific fix.

As usual, getting into the meat of creating a PR exposed some issues with the earlier proposal. I made the following changes:

At @clnoel's suggestion I moved the elements out of <global> and permitted them to point to it by reference, if desired.
Individual sync points had two main problems:
- Gaps in the audio (where the music stops) could not be represented
- A messy rule was needed to enforce sync points at form jumps
My solution was to define sync ranges with both start and end points. While this leads to overlapping start and end times when there are no gaps, it has no problem describing gaps. It makes it completely clear that form jumps must be described using distinct, non-sequential ranges of notated content.
For consistency with all other elements that specify score ranges (e.g. spanning directions), sync endpoints use "measure location syntax".
Measure location syntax itself has been changed to use measure indices, since there's been an evident desire in the CG to make more use of indices both in this case and generally.

cecilios commented 6 years ago

Measure location syntax itself has been changed to use measure indices, since there's been an evident desire in the CG to make more use of indices both in this case and generally.

There is a problem using indices. Measure id is unique for the score, but measure index not and therefore it is also required to specify score-part. Normally, sore-part will be irrelevant as all parts will have the same number of measures, but this is not the case in multimetric scores. Measure location definition should be valid for any score type/profile unless it has a different definition for each profile (IMO not desirable).

clnoel commented 6 years ago

@cecilios After having looked at the actual changes, this objection has been handled. All the existing places where measure location are used are within <global> or <part> and the redefinition to use indices explicitly states that the index is for the current Measure content region... that is, the current <global> or <part>.

The only place where it got a little dicey is the new use of Measure Location for start and end points of the <score-audio-region>, since @joeberkovitz has pulled the <score-audio> out of the globals, and directly into mnx-common. Even this is handled by specifying a system parameter of <score-audio> which specifies which global is being used to define the measure indices. Does that answer your concern, @cecilios?

cecilios commented 6 years ago

@clnoel Current definition is in the Notational syntaxes section, that is, common definitions for any MNX-Common profile, and nothing in this definition restrict the use of measure location outside of <global> or <part>.

Alhought current uses of measure location do not present problems, my concerns are more abstract, thinking about the evolution of current spec. and about a future use of measure location already not taken into account. Once a definition is accepted and widely used, there is a risk of using the concept in new places and to forget to review its applicability to marginal cases, such as multimetric scores. So, to avoid future problems with this, I would suggest to add some remainder paragraph in the definition of measure location.

clnoel commented 6 years ago

@cecilios You're right that it is not explicitly stated, however in:

https://w3c.github.io/mnx/specification/#measure-locations

Point two ends with: The identified measure must belong to the same measure content as the element in which the measure location is given.

@joeberkovitz Perhaps we should make this more explicit?

joeberkovitz commented 6 years ago

Sounds like a clarification would definitely be helpful. Please watch the PR for the fix.

cecilios commented 6 years ago

Thank you. I think now this is more clear.

w3c / mnx

Define audio synchronization in absence of graphical score #67