Closed btsimonh closed 8 months ago
That sounds feasible: does it need a specification change, or just an example?
I'm not familiar enough with TTML timing. Basically, for DAPT, I'd like to go with div as the block timed object. Then we'd need to define that begin in span is relative to the begin in div. If you propose something legal, I can run over it and we can discuss? One question would be do we need begin AND end - if no end, then the span is 'presented' until the end of the div, which is not the intent - the intent is that the span is 'presented' (spoken) within a small time period. BUT... then a TTML texttual presentation would be a bunch of flashing individual words!
yep. more thought required!. br, S
I'd like to go with div as the block timed object
Please could you explain more about why you'd like this?
It feels to me like each utterance (single person talking) is a separate thing - i.e. not necessarily presented in the same 'region'. div feels semantically like it's at that level. And also, I'd like for DAPT to be 'very well defined' - like IMSC-rosetta. It should be TTML compliant, but MUST be usable, unambiguous, and only done in one way, not be subject to XML anomalies. I will go so far to say publicly that I'd fix namespace prefixes and other 'fluff' to ensure that it actually gets used, rather than just saying 'must be valid XML' - which helps no-one in the real world.
Let's not digress with XML issues. If you have comments on XML, please raise another issue.
Getting back to the initial issue:
In TTAL, this is supported through a separate char offset->time table in each event.
I assume you refer to the concept of "Segments" in TTAL. Segments are a way to provide inner timing for subparts of an event. We have seen it used in some dubbing workflows. It provides indication during voice recording of how long some words have to be spoken, to vary speed. When we proposed TTAL to become DAPT, we initially thought it was not a major use case for V1 and did not push for it but we are happy to reconsider.
@btsimonh can you clarify the use case and if you'd like this to be covered in v1 or if it can be deferred to v2?
Use case:
DAPT states it's goal as a format for representing dubbing information.
If so, it must be able to represent 'Adaptation' information (the timing of utterances, probably down to Syllable resolution), otherwise it can't represent lipsync dubbing.
During a lipsync dubbing workflow, both the translation text and the Adaptation of that text are absolutely key.
The translation must fit the correct number and rough timing of the lip movements (IF the actor is on screen and close enough to see lips!).
The adaptation of the text then modifies the presented timing of the text such that a dubbing artist can be guided in the spoken timing (presented as wipe or rythmo band).
The adaptation is one of the more intensive tasks in dubbing, and as such time consuming and costly. To have value, DAPT must be able to represent the part of the job which carries value.
I don't think it's hard to include adaptation information in DAPT, but I do think it will result in very 'wordy' constructs. The approach both we and Netflix have taken is effectively to timestamp points in the text. i.e. 'this character should be be presented at Xs', and interpolate between these points. I beleive Netflix will have adopted this style on the advice of another Dubbing software vendor, and so the same style has been adopted at least twice independently. I believe this approach would be served by a construct like:
<p>
<span>The </span>
<span begin=0.3>words add </span>
<span begin=0.9>on</span>
</p>
But... as stated above, semantically it should possibly be:
<p>
<span begin=0 end=0.3>The </span>
<span begin=0.3 end=0.9>words add </span>
<span begin=0.9 end=1.1>on</span>
</p>
However, I prefer without end
, for the selfish reason that it fits with extant formats, and so will be easier to implement/adopt.
In my opinion, DAPT should have this in V1.
p.s. one thing we should strive for is supporting increased workflow efficiency. If dubbing into multiple languages, it may make sense to 'Adapt' the original text, such that the adaptation can be reflected into the translated text more easily (i.e. adapt once in the original, then the adaptor's job becomes more guided and more efficient, so reducing the work for the specific language adaptation. As such, this indicates that any storage of the original text should be 'adaptable' (personally, for us, this would normally be a separate file, but we could adapt to it being in the same file, pun intended!).
I agree that a construct such as the ones @btsimonh suggested would work. Not sure which is better, but it's worth noting that they are semantically different.
If you were presenting the text visually, then in the first example you would see:
Time | Visible text |
---|---|
0s | The |
0.3s | The words add |
0.9s | The words add on |
Whereas in the second example you would have:
Time | Visible text |
---|---|
0s | The |
0.3s | words add |
0.9s | on |
1.1s |
However in an audio presentation the two should (probably?) be equivalent, and we should explicitly state that in a case like this the intent is not to repeat text that was already spoken.
just to be clear on the eventual visual representation in a real-world application of DAPT:
A possible 'render' of text during adaptation. Red markers represent places where 'begin' has been added. Effectively in the below there are three spans, two with begin (the last span contains multiple 'opportunities' for adaptation, but the adaptor has not taken those opportunities).
During presentation to the voice talent, this may be a rythmo band. In the below, the text is moving to the left past a green marker indicating current time.
Or it may be a wipe. In the below, the red progresses through the text, the speed of progression dependent upon the adapted timing.
More advanced presentation can be imagined, including prosody other than speed (e.g. volume, tone, both probably interpolated).
Note that in terms of TTS, many prosody features are not (yet) available with the degree of control needed for lip-sync dubbing. However, technology moves quickly. If such features were to become part of DAPT, my preference would be as span attributes to keep parsing simple.
We recently added an example in the introduction. We could also add a note in the Text section to indicate that the text can be timed-structured, for the purpose of adaptation. @btsimonh would it be sufficient or do you think we should put constraints on how inner timing should be specified.
I am a believer in explicit specification to simplify implementation :). The above example uses the 'this span starts at x seconds relative to the div' approach, and I think this is both simple and implementable. (i.e. begins only. But please, only begin values in (float) seconds). br, Simon
Constraining the format of time expressions should be a separate issue.
Constraining to begin
only is an interesting one. What if it's useful adaptation information to declare with precision not only when a particular word should begin being spoken, but also when a particular word should have ended being spoken? Would you accept end
then, @btsimonh ?
For the two implementations I know of (ours and netflix TTAL), a single time 'marker' is used at a text position. Yella do it (effectively) with begin in span, TTAL does it with a separate array of (charposn, time). The use of end is not hard, but is it required?...
examples:
<div begin="100s" end="102.8s">
<p><span begin="1.5" end="1.8">word</span><span begin="1.8" end="1.9"> </span><span begin="1.9" end="2.8">Note the space</span></p>
</div>
is equivalent to:
<div begin="100s" end="102.8s">
<p><span begin="1.5">word</span><span begin="1.8"> </span><span begin="1.9">Note the space</span></p>
</div>
is equivalent to (pseudo TTAL):
{
times:[ { t:1.5, posn:0 }, { t:1.8, posn:4 }, { t:1.9, posn:5 }, ],
text: "word Note the space",
}
Note in the above that the first (with end) presentation may or may not include <span begin="1.8" end="1.9"> </span>
. Personally I would say it's required to preserve the space in the text for text extraction?
When you think about it, the time points in dubbing are put there to provide timing points which should match the video - i.e. lip movements. You want to add as few as practical.
The other thing to consider is what happens during preparation/edit. Generally, the text could be presented over a time period (e.g. wiped over or rythmo band style). At first, no time points are added; you just have the text, and the text spreads to fill the time (div) (the way the text spreads over time is application dependent - think monospaced vs proportional font - this would affect word 'timing' greatly). Then a time point is added - for a specific lip movement at a syllable - and adjusted to be positioned where that lip movement starts. The rest of the text stretches or compresses to fit in with the 'modified' position of the syllable.
So, in the simplest case, one begin is added as a time point to hit, and is all that is required. If we encourage the use of end, then we may find documents which by default have begin and end for every word, which although 'correct' would be a right royal pain to edit.
BUT. having written this up, I see a much larger issue reference adaptation. Consider:
<div begin="100s" end="102.8s">
<p><span>fred </span><span style="sbold">eats </span><span begin="1.8">chicken </span><span style="sbold">sandwiches</span></p>
</div>
The first span will start with the div. so when does the 2nd span start? The third span starts at div+1.8s. When does the fourth start? I'm guessing that from ttml rules, all except chicken would be on screen at the start of the div, and this is not desired
So I am not convinced that using 'normal' ttml constructs for this timing aspect is advisable.
Must we:
<div begin="100s" end="102.8s">
<p><span><span>fred </span><span style="sbold">eats </span></span><span begin="1.8"><span>chicken </span><span style="sbold">sandwiches</span></span></p>
</div>
or should we introduce a NEW attribute to specifically state 'this is the time point expected to be hit by the voice actor' - which has nothing to do with TTML?
Or maybe better, just do it with a foreign element - e.g.
e.g.
<div begin="100s" end="102.8s">
<p><span><dapt:tp offs="1.5s"/>word<dapt:tp offs="1.8s"/> <dapt:tp offs="1.9s"/>Note the space</span></p>
</div>
and if we did this, I'd like to see it specified for easy intepretation, e.g. 'must be in the root of P':
<div begin="100s" end="102.8s">
<p><dapt:tp offs="1.5s"/><span>word</span><dapt:tp offs="1.8s"/><span> </span><dapt:tp offs="1.9s"/><span>Note the space</span></p>
</div>
sorry for twisting it again :(.
Thanks for the deep thoughts @btsimonh . I'm not sure I explained my question about end attributes very well.
What I was trying to get to is something like this:
In other words, why is begin more important than end in this scenario? Particularly if there's a pause before the next word in the sentence?
Your analysis is right that the begin time of each span is relative to its parent element's begin time, and this is true regardless of the timing of the previous sibling, or the order of siblings.
That's true in DAPT now because we only allow timeContainer="par"
(the default value). If sibling-relative timing is a useful feature, we could allow timeContainer="seq"
which relates the first child's begin time to it parent's begin time, and subsequent siblings' begin times are all relative to the previous sibling's end time.
So:
<div begin="100s" end="112s">
<p timeContainer="seq">
<span begin="5s" dur="2s">105s until 107s</span>
<span dur="2s"> 107s until 109s<span>
<span begin="1.5s"> 110.5s until 112s<span>
</p>
</div>
If that makes sense. This adds some implementation complexity, and I assumed nobody would want that until now.
I'd be extremely reluctant to introduce new timing semantics - I think what TTML has is plenty.
Reviewed 2023-02-09 - Example 9 shows some adaptation info, but §4.4 Text needs to explain something about the children of <p>
elements, specifically <span>
s with times.
Per the previous comment, I suggest closing this issue.
@cconcolato I'm confused - the previous comment was mine, in which I said I think we need to say something about <span>
s with times, not close the issue. And your previous comment seemed to say something similar.
Discussed during TTWG call 2023-06-08: agreed that we should do something above a somewhat obscure part of example 10 here. Suggest we add mention of <span>
to §4.4 Text which already has a Note about it, but be more explicit that:
<p>
element and its descendent <span>
s<span>
s can be used to add specific timing as well as styling and metadata<span>
they are relative to the parent element's computed begin
timeQuestions:
<p>
elements?
<p>
- how should a processor deal with a document that had such timing? I'd rather not prohibit it, but I'd also like to know if there's any use case for including it. One possibility is that translations have different durations and so need different start times in order to have the same centre point, timing-wise. In particular that could be relevant for AD where there isn't lip-sync alignment point.@cconcolato this is assigned to you - are you happy to make the edits, and does all this make sense to you?
The use of timing in Text objects is intended to be used to indicate the timing of the audio rendering of the relevant section of text
I would say "rendering" or "recording"
Are we prohibiting timing on
<p>
elements?
I would say yes. Given that a Text object is 1 <p>
and that Text objects in an Event can only differ by language, I don't see a reason to have timing on <p>
. I would be ok saying "SHOULD NOT".
Does the Data Model diagram need any adjustments to reflect this?
The diagram is about model entities. If we start adding spans to the diagram, then we should have a model entity for it. I don't think that's necessary.
@cconcolato this is assigned to you - are you happy to make the edits, and does all this make sense to you?
I think it was assigned to me when the example needed to be added. Feel free to reassign it to you.
Step PS2d generally modifies the word or syllable timing of the text. In TTAL, this is supported through a separate char offset->time table in each event. In .srtdub, it is supported by interspersing <begin start=(time offset)/> inside the text.
In TTML, it could be represented by <span begin=(time offset)>text