Preserving timed text (and pagination issue?)

pietrop commented 3 years ago

Working on this PR https://github.com/pietrop/slate-transcript-editor/pull/30 I run into an issue with figuring out the right logic to paginate the transcript.

The issue

TL;DR: The issue is that when the user corrects the text, it might delete, substitute or insert new words. These operations tend to loose the time-codes originally associated with each word. The alignment module currently in use, loses performance for transcripts over one 1 hour. So we are considering pagination as a ~quick~ fix.

If you truly want the TL;DR version skip to the Pagination heading. Otherwise click here for more context

### Context

Some ~~quick~~ background for those new to the project.

`slate-transcript-editor` builds on top of the lessons learned from developing [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) (based on [draftJs](https://draftjs.org/)). As the name suggests `slate-transcript-editor` is built on top of [slateJs](https://slatejs.org) augmenting it with transcript editing domain specific functionalities. For more on "draftjs vs slatejs" for this use case, see [these notes](https://github.com/pietrop/slate-transcript-editor/blob/master/docs/notes/draftjs-vs-slatejs.md). It is a react transcript editor component to allow users to correct automated transcriptions of audio or video generated from speech to text services. It is used in use cases such as [autoEdit](https://www.autoedit.io), an app to edit audio/video interviews, as well as other situation where users might need to correct transcriptions, for a variety of use cases. The ambition is to have a component that takes in timed text (eg a list of words with start times), allows the user to correct the text (providing some convenience features, such pause while typing, and keeping some kind of correspondence between the text and audio/video) and on save returns timed text in the same json format (referred to, for convenience, as dpe format, after the digital paper edit project where it was first formalized). ```js { "words": [ { "end": 0.46, // in seconds "start": 0, "text": "Hello" }, { "end": 1.02, "start": 0.46, "text": "World" }, ... ] "paragraphs": [ { "speaker": "SPEAKER_A", "start": 0, "end": 3 }, { "speaker": "SPEAKER_B", "start": 3, "end": 19.2 }, ... ] } ``` As part of `slate-transcript-editor` this dpe format is then converted into [slateJs](https://www.slatejs.org/) data model. [see storybook demo to see the `slate-transcript-editor` react componet it in practice](https://pietropassarelli.com/slate-transcript-editor)

Over time in this domain folks have tried a variety of approaches to solve this problem. #### compute the timings listening to char insertion, deletion and detecting word boundaries, you could estimate the time-codes. This is a very fiddly approach, as there's a lot of edge cases to handle. Eg what if a user deletes a whole paragraph? And over time the accuracy of the time-codes slowly fades (if there's a lot of correction done to the text, eg if the STT is not very accurate). #### alignment - server side - Aeneas Some folks have had some success running server side alignment. For example in [pietrop/fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) the editor was one giant content editable div, and on save it would send to the server plain text version (literally using `.innerText`). @frisch1 then server side would then align it against the original media using the [aeneas aligner](https://github.com/readbeyond/aeneas) by @pettarin. Aeneas converts the text into speech (TTS) and then uses that wave form to compare it against the original media to very quickly produce the alignment, restoring time-codes, either at word or line level depending on your preferences. Aeneas uses dynamic time warping of math frequency capsule coefficient algo (🤯). You can read more about how Aeneas works in the [How Does This Thing Work?](https://github.com/readbeyond/aeneas/blob/4d200a050690903b30b3d885b44714fecb23f18a/wiki/HOWITWORKS.md) section of their docs. This approach for [fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) was some what successful, Aeneas is very fast. However - the alignment is only done on save to the database. - If a user continues to edit the page over time more and more of the time-codes will disappear until the refresh the page, and the "last saved and aligned" transcript gets fetch from the db. - And to set this up as "a reusable component" you'd always have to pair with a server side module to do the alignment - Aeneas is great but in it's current form does not exist as an npm module (as far as I am aware of?) it's written in python and has some system dependencies such as ffmpeg, TTS engine etc..

side note on word level time-codes and clickable words

I should mention that in [fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) you could click on individual words to jump to corresponding point in the media. With something equivalent to ```html Hello ... ``` A pattern I had first come across in [hyperaud.io's blog description of "hypertranscripts"](https://hyperaud.io/blog/hypertranscripts/) by @maboa & @gridinoc

#### STT based alignment - Gentle Some folks have also used [Gentle](https://github.com/lowerquality/gentle), by @maxhawkins, a forced aligner based on Kaldi as a way to get alignment info. I've personally [used it for autoEdit2](https://autoedit.gitbook.io/user-manual/setup-stt-apis/setup-stt-apis-gentle) as an open source offline option for users to get transcriptions. But I haven't used it for alignment, as STT based alignment is slower then TTS one. #### alignment - client side - option 1 (stt-align) Another option is to run the alignment client side. by doing a diff between the human corrected (accurate) text and the timed text from the STT engine, and to transpose the time-codes from the second to the first.

some more background and info on this solution

This solution was first introduced by @chrisbaume in [bbc/dialogger](https://github.com/bbc/dialogger) ([presented at textAV 2017](https://textav.gitbook.io/textav-event/projects/bbc-dialogger)) it modified [CKEditor](https://ckeditor.com) (at the time draftJS was not around yet) and run the alignment server side in a custom python module [sttalign.py](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py) With @chrisbaume's help I converted the python code into a node module [stt-align-node](https://github.com/pietrop/stt-align-node) which is used in [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) and [slate-transcript-editor](https://github.com/pietrop/slate-transcript-editor) one issue in converting from python to [the node version](https://github.com/pietrop/stt-align-node/blob/master/src/align/index.js) is that for diffing python uses the [difflib](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py#L31) that is [part of the core library](https://docs.python.org/3/library/difflib.html) while in the node module [we use](https://github.com/pietrop/stt-align-node/blob/master/src/index.js#L27) , [difflib.js](https://github.com/qiao/difflib.js) which might not be as performant (❓ 🤷‍♂️ ) When a word is inserted, (eg was not recognized by the STT services and the users adds it manually) in this type of alignment there are no time-codes for it. Via interpolation of time-codes of neighboring words, we bring back add some time-codes. In the python version the time-codes interpolation is done via [numpy](https://numpy.org) to [linearly interpolate the missing times](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py#L3-L16) In the [node version the interpolation](https://github.com/pietrop/stt-align-node/blob/master/src/align/index.js#L61-L95) is done via the [everpolate](http://borischumichev.github.io/everpolate/#linear) module and again it might not be as performant as the python version (❓ 🤷‍♂️ ).

However in [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) and [slate-transcript-editor](https://github.com/pietrop/slate-transcript-editor) initially every time the user stopped typing for longer then a few seconds, we'd trigger a save, which was proceeded by an alignment. This became very un-performant, especially for long transcriptions, (eg approximately over 1 hour) because whether you change a paragraph or just one word, it would run the alignment across the whole text. Which turned out to be a pretty expensive operation. This lead to removing user facing word level time-codes in the slateJs version to improve performance on long transcriptions. and removing auto save. However, on long transcription, even with manual save, sometimes the `stt-align-node` module can temporary freeze the UI for a few seconds 😬 or in the worst case scenario sometimes even crash the page 😓 ☠️

more on retaining speaker labels after alignement

There is also a workaround for handling retaining speaker labels at paragraph level when using this module to run the alignment. The module itself only aligns the words. To re-introduce the speakers, you just compare the aligned words with the paragraphs with speaker info. [Example of converting into slateJs format](https://github.com/pietrop/slate-transcript-editor/blob/master/src/util/update-timestamps/index.js#L15-L47) or into [dpe format from slateJs](https://github.com/pietrop/slate-transcript-editor/blob/pagination/src/util/export-adapters/slate-to-dpe/index.js#L14-L40)

Which is why in PR https://github.com/pietrop/slate-transcript-editor/pull/30 we are considering pagination. But before a closer look into that, let's consider one more option. #### alignment - client side - option 2 (web-aligner) Another option explored by @chrisbaume at textAV 2017 was to make a [webaligner](https://github.com/chrisbaume/webaligner) ([example here](http://pietropassarelli.com/webaligner-example/index.html) [and code of the example here](https://github.com/chrisbaume/webaligner-example)) to create a ~simple~ lightweight client-side forced aligner for timed text levering the browser audio API ([AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext)), and doing computation similar to Aeneas(? not sure about this last sentce?). This option is promising, but was never fully fleshed out to a usable state. It might also only work when aligning small sentences due to browser's limitations(?). #### 5. Overtyper Before considering pagination, a completely different approach to the UX problem of correcting text is [overtyper](https://github.com/alexnorton/overtyper) by @alexnorton & @maboa from textAV 2017. Where you follow along a range of words being hiligteed as the media plays. To correct you start typing from the last correct word you heard until the next correct one, so that the system can adjust/replace/insert all the once in between. This makes the alignment problem a lot more narrow, and new word timings can be more easily computed. This is promising, but unfortunately as far as I know there hasn't been a lot of user testing to this approach to validate.

Pagination

For slate-transcript-editor we've been using (option 3) client side alignment with stt-align-node to restore time-codes on user's save.

However because of the performance issue on large transcription, we've been considering pagination - PR https://github.com/pietrop/slate-transcript-editor/pull/30 but run into a few issues.

For now we can assume the transcription comes as one payload from the server. And I've been splitting it into one hour chunks.

The idea is that the slateJs editor can be responsible for the text editing part, and alignment, save, export in various format can be done in the parent component to provide a cohesive interface that for example. Merges all the pages into one doc before exporting but only updates the current chunk when saving.

questions

Should these chunk be store in the state of the parent component or is there a performance issue in doing that in react?
Should you loop through the chunks in the render method and only display the current one? Is this a good pattern to use or is there a better one?
Should the state of the slateJS editor be held in the parent component? (this seemed to cause a performance issue)
on change of the slateJS editor, do we just update the current chunk or also the array of chunks? (this seemed to cause a performance issue)

I am going to continue to try a few other things here but any thoughts, ideas 💡 or examples on react best practice when dealing with react to paginate text editors are much appreciated.

Quick disclaimer: Last but not least this is my best effort to collect info on this topic in order to frame the problem and hopefully get closer to a solution, if some of these are not as accurate as they should be, feel free to let me know in the comments.

pietrop commented 3 years ago

also relevant via @xshy216 https://github.com/pietrop/slate-transcript-editor/issues/10#issuecomment-722846904

pietrop commented 3 years ago

An update on the latest thinking, and a chance to recap some of the current progress.

Deferring pagination exploration

After talking to @rememberlenny I decided defer trying out pagination in favor of an approach that tries to single out paragraphs that have changed and align only those.

Options for aligning only the paragraphs that changed

There's two ways in which you could do that,

one is to use slateJs api for onKeyDown and/or onChange and keep some sort of list that keeps track where the changes in the doc have been made, based on user cursor and selection. For now this seems laborious.
The other is to compare the paragraphs and single out those that have changed, and only run the alignment for those (using pietrop/stt-align-node) .

Word level timings and clickable words

Slightly unrelated, but relevant, similar to the DraftJs approach of using entities in @bbc/react-transcript-editor, but somehow way more performant, we can bring back clickable words, by adding them as an attribute to the text child node, along side the text attribute.

example

```js [ { "type": "timedText", "speaker": "James Jacoby", "start": 1.41, "previousTimings": "0", "startTimecode": "00:00:01", "children": [ { "text": "So tell me, let’s start at the beginning.", "words": [ { "end": 1.63, "start": 1.41, "text": "So" }, { "end": 2.175, "start": 1.63, "text": "tell" }, { "end": 2.72, "start": 2.175, "text": "me," }, { "end": 2.9, "start": 2.72, "text": "let’s" }, { "end": 3.14, "start": 2.9, "text": "start" }, { "end": 3.21, "start": 3.14, "text": "at" }, { "start": 3.21, "end": 3.28, "text": "the" }, { "end": 4.88, "start": 4.346666666666666, "text": "beginning." } ] } ] }, ... ```

We can add onDoubleClick to the renderLeaf component.

 onDoubleClick={handleTimedTextClick}

And use a getSelectionNodes helper function to use slateJS selection/cursor position to return timecode of current word. Assuming text has not been edited using selection offset vs word's objects list text char count gives you the start time of the word being clicked on (if that makes sesnse?).

Paragraph changes

Option 2 assumes that paragraphs are not changing, eg splitting or merging a paragraph. OR that this is being handled separately from the alignment process.

For now I've disabled splitting and merging paragraph, via Enter and Backspace key (eg if Backspace is at beginning of the paragraph). However you can still delete multiple words within one paragraph.

example

```js // TODO: revisit logic for // - splitting paragraph via enter key // - merging paragraph via delete // - merging paragraphs via deleting across paragraphs const handleOnKeyDown = (event) => { console.log('event.key', event.key); if (event.key === 'Enter') { // intercept Enter event.preventDefault(); console.log('For now cdisabling enter key to split a paragraph, while figuring out the aligment issue'); return; } if (event.key === 'Backspace') { const selection = editor.selection; console.log('selection', selection); console.log(selection.anchor.path[0], selection.focus.path[0]); // across paragraph if (selection.anchor.path[0] !== selection.focus.path[0]) { console.log('For now cannot merge paragraph via delete across paragraphs, while figuring out the aligment issue'); event.preventDefault(); return; } // beginning of a paragrraph if (selection.anchor.offset === 0 && selection.focus.offset === 0) { console.log('For now cannot merge paragraph via delete, while figuring out the aligment issue'); event.preventDefault(); return; } } ```

option 2. identify paragraphs that have changed

One idea from @rememberlenny is that If you don't run the alignment on every keystroke or when the user's stop typing (which are both possible optimization to consider - via @gridinoc) then you need to find which paragraphs have changed, and only align those.

I found that lodash differenceWith is pretty snappy. And you can specify a comparator function. Which allows you to for example only compare the text attribute of the child node, as opposed to the whole paragraph block.

example

```js /** * Update timestamps usign stt-align module * @param {*} currentContent - slate js value * @param {*} words - list of stt words * @return slateJS value */ // TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes. // TODO: in stt-align-node if all the words are completely diff, it seems to freeze. // Look into why in stt-align-node github repo etc.. export const updateTimestampsHelper = (currentContent, dpeTranscript) => { // TODO: figure out if can remove the cloneDeep option const newCurrentContent = _.cloneDeep(currentContent); // trying to align only text that changed // TODO: ideally, you save the slate converted content in the parent component when // component is initialized so don't need to re-convert this from dpe all the time. const originalContentSlateFormat = convertDpeToSlate(dpeTranscript); // TODO: add the ID further upstream to be able to skip this step. // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the // slate editor content. // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown const currentSlateContentWithId = currentContent.map((paragraph, index) => { const newParagraph = { ...paragraph }; newParagraph.id = index; return newParagraph; }); const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); // This gives you a list of paragraphs that have changed, and because we added indexes via ids, we can easily and quickly identify them and run alignment on individual paragraphs. ```

option 2. align individual paragraphs that have changed

Once you have the individual paragraphs that need aligning you can run alignSTT on each and replace them in the slateJs editor current content value list of paragraphs.

example

```js const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); diffParagraphs.forEach((diffParagraph) => { // TODO: figure out if can remove the cloneDeep option let newDiffParagraph = _.cloneDeep(diffParagraph); let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text); newDiffParagraph.children[0].words = alignedWordsTest; // also adjust paragraph timecode // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary // but keeping because eventually will handle use cases where paragraphs are modified. newDiffParagraph.start = alignedWordsTest[0].start; newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start); newCurrentContent[newDiffParagraph.id] = newDiffParagraph; }); return newCurrentContent; }; ```

fulll example

```js // TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes. // TODO: in stt-align-node if all the words are completely diff, it seems to freeze. // Look into why in stt-align-node github repo etc.. export const updateTimestampsHelper = (currentContent, dpeTranscript) => { // TODO: figure out if can remove the cloneDeep option const newCurrentContent = _.cloneDeep(currentContent); // trying to align only text that changed // TODO: ideally, you save the slate converted content in the parent component when // component is initialized so don't need to re-convert this from dpe all the time. const originalContentSlateFormat = convertDpeToSlate(dpeTranscript); // TODO: add the ID further upstream to be able to skip this step. // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the // slate editor content. // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown const currentSlateContentWithId = currentContent.map((paragraph, index) => { const newParagraph = { ...paragraph }; newParagraph.id = index; return newParagraph; }); const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); diffParagraphs.forEach((diffParagraph) => { // TODO: figure out if can remove the cloneDeep option let newDiffParagraph = _.cloneDeep(diffParagraph); let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text); newDiffParagraph.children[0].words = alignedWordsTest; // also adjust paragraph timecode // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary // but keeping because eventually will handle use cases where paragraphs are modified. newDiffParagraph.start = alignedWordsTest[0].start; newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start); newCurrentContent[newDiffParagraph.id] = newDiffParagraph; }); return newCurrentContent; }; ```

up next.

See latest commit of the PR https://github.com/pietrop/slate-transcript-editor/pull/36 for more details on this.

[x] Handle split paragraph via Enter . Eg split associated list of words objects in the two new paragraphs
[x] Handle merge paragraphs via Backspace. Eg merge the list of words from in the two old paragraphs
[x] handle regular delete within a paragraph

Refactor/clean up

[ ] see if can remove the need for cloneDeep
[ ] see if can remove convertDpeToSlate for comparison. Eg save in state slateJs pre last changed(?)
[ ] if optimizing to run it on char change or on stop typing. could pass current paragraph, and skip the differenceWith computation step. (Altho would need to figure out how to handle if corrects one paragraph, then go to the next one quickly eg without triggering an alignment in between)

Also

[x] ~consider what happens if hit Enter with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ?~ for now intercepted and disabled it instead
[x] consider consider what happens if hit Backspace with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ?

And

[ ] figure out if instead of ♻️ alignment btn, should add/bring back a some (similar to @gridinoc suggestion) some logic to run align programmatically, eg on every keystroke, on user stop typing. This would need to be debounced, and could make use of requestIdleCallback to make it more efficient.
[ ] add an option to insert new text to replace and re-align current one, since multi paragraph delete is now disabled.

pietrop commented 3 years ago

Some thoughts after recent refactor https://github.com/pietrop/slate-transcript-editor/pull/36

[x] Handle split paragraph via Enter . Eg split associated list of words objects in the two new paragraphs
[x] Handle merge paragraphs via Backspace. Eg merge the list of words from in the two old paragraphs
[x] handle regular delete within a paragraph
[x] ~consider what happens if hit Enter with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ?~ for now intercepted and disabled it instead
[x] ~consider consider what happens if hit Backspace with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ?~ for now intercepted and disabled it instead

on 💡 ~You are not allowed to completely delete a paragraph?~ as it could make things easier for alignment, as a paragraph will always have timed words associated with it.

[x] But 💡 As you are not creating new empty paragraphs (enter only works within a paragraph to split) .  And since delete now also merges and preserves timecode.  Then when we run alignment  could we just compare the timecode in the words attribute with the text of the block? And align those if the text is different from the text in the words? 
- eg if word count same but text different, only replace the words in the and keep time-codes etc..
- If word count diff, then runs sttAlignNode? etc...?

This would mean that you are running the STT align against the most recent re-alignment, as opposed to the original STT data. But would give flexibility to handle changing paragraphs. As well as skip alignment of paragraphs that might not needed.

Still unsure of frequency of the alignment, def on save, but not sure if it should happen on pause typing, maybe not for now. Need to check performance against longer file (1 to 5 hours example)

pietrop commented 3 years ago

Updated storybook demo https://pietropassarelli.com/slate-transcript-editor/ to reflect this PR https://github.com/pietrop/slate-transcript-editor/pull/36

to recap

[x] double clicking on a word takes you to that point in the media (as opposed to before where it was paragraph level only)
[x] still no word level highlight by design, to keep it performant, but open to add it if there's some good 💡
[x] handles split of a paragraph (and split corresponding words list associated with paragraph using cursor char offset)
[x] handle delete at beginning of a paragraph to merge two paragraphs (+recombine words list into new paragraph and move cursor /selection)
[x] disable split of a paragraph while selecting text (for now?)
[x] disable delete text across paragraphs
[x] handles delete text within a paragraph
[x] alignment btn / restore timecodes by comparing slateJs words list and text in blocks/paragraphs for changes in text, while ignoring white spaces. This way only align with stt-align-node the paragraphs that have changed
[x] there's a flag to check if the text as been modified, if it has not, skips alignment, when saving or exporting from the editor, as an optimization.
[x] save btn save runs alignment
[x] export btn runs alignment
[x] refactored to use material UI for ease of theming and portability.
[x] as a side effect, localized at paragraph level alignment means you can run alignment also in live use case with interim results populating the editor - see http://localhost:6006/?path=/story/live--editable

Some things I am not sure about

[ ] I think it be neat to use a timer, or some kind of debounce to bring back have auto save. But I am not sure if I am doing it right. I have one in place for optional "pause while typing" but it seems like introducing a timer that way in on key dow might introduce performance issues? 🤷‍♂️ any thoughts or 💡 ❓
[ ] auto save could also run auto alignment with the same logic when the user stop typing if there has been any changes - if it doesn't effect performance.

extra / stretch goal

[x] one thing that I found myself using for certain projects was selecting the whole text and replacing it with accurate transcription (without speakers) in order to use the editor to re-align it and export a time-coded version. This wouldn't work without the possibility of bulk delting or replacing paragraphs. So addded a dedicated btn with a prompt where you can paste the new text, and it run alignment and repelace the slateJs content, while preserving the slatejs paragraph breaks. (altho not sure if that's good. might revist that, it might be better if it does the paragraph breaks based on line breaks of input text. or maybe it's not an issue for now 🤷‍♂️ ) This would probls till freeze the UI for a long transcript well over 1 hour.

pietrop commented 3 years ago

PR https://github.com/pietrop/slate-transcript-editor/pull/36 recap

[x] Change pause while typing to use debounce instead of timer
[ ] got debounce working for alignment when user stops typing, but, commented it out for now coz cannot properly asses if it effects performance
[ ] can consider adding auto save, as part of the debounce alignment if it doesn't effect performance.
[x] inserting text + enter, does alignment before the split
[x] deleting text before merging two paragraphs does alignment before the merge

pietrop commented 3 years ago

this has been merged to master and deployed alpha releases to test it out and make it easier to revert back if needed. Will bump up the version when there's more confidence that it was a successful refactor that didn't introduce 🐞

closing this for now.

pietrop / slate-transcript-editor