pietrop / slate-transcript-editor

A React component to make correcting automated transcriptions of audio and video easier and faster. Using the SlateJs editor.
https://pietrop.github.io/slate-transcript-editor
Other
75 stars 33 forks source link

Preserving timed text (and pagination issue?) #32

Closed pietrop closed 3 years ago

pietrop commented 3 years ago

Working on this PR https://github.com/pietrop/slate-transcript-editor/pull/30 I run into an issue with figuring out the right logic to paginate the transcript.

The issue

TL;DR: The issue is that when the user corrects the text, it might delete, substitute or insert new words. These operations tend to loose the time-codes originally associated with each word. The alignment module currently in use, loses performance for transcripts over one 1 hour. So we are considering pagination as a ~quick~ fix.

If you truly want the TL;DR version skip to the Pagination heading. Otherwise click here for more context ### Context
Some quick background for those new to the project. `slate-transcript-editor` builds on top of the lessons learned from developing [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) (based on [draftJs](https://draftjs.org/)). As the name suggests `slate-transcript-editor` is built on top of [slateJs](https://slatejs.org) augmenting it with transcript editing domain specific functionalities. For more on "draftjs vs slatejs" for this use case, see [these notes](https://github.com/pietrop/slate-transcript-editor/blob/master/docs/notes/draftjs-vs-slatejs.md). It is a react transcript editor component to allow users to correct automated transcriptions of audio or video generated from speech to text services. It is used in use cases such as [autoEdit](https://www.autoedit.io), an app to edit audio/video interviews, as well as other situation where users might need to correct transcriptions, for a variety of use cases. The ambition is to have a component that takes in timed text (eg a list of words with start times), allows the user to correct the text (providing some convenience features, such pause while typing, and keeping some kind of correspondence between the text and audio/video) and on save returns timed text in the same json format (referred to, for convenience, as dpe format, after the digital paper edit project where it was first formalized). ```js { "words": [ { "end": 0.46, // in seconds "start": 0, "text": "Hello" }, { "end": 1.02, "start": 0.46, "text": "World" }, ... ] "paragraphs": [ { "speaker": "SPEAKER_A", "start": 0, "end": 3 }, { "speaker": "SPEAKER_B", "start": 3, "end": 19.2 }, ... ] } ``` As part of `slate-transcript-editor` this dpe format is then converted into [slateJs](https://www.slatejs.org/) data model. [see storybook demo to see the `slate-transcript-editor` react componet it in practice](https://pietropassarelli.com/slate-transcript-editor)
Over time in this domain folks have tried a variety of approaches to solve this problem. #### compute the timings listening to char insertion, deletion and detecting word boundaries, you could estimate the time-codes. This is a very fiddly approach, as there's a lot of edge cases to handle. Eg what if a user deletes a whole paragraph? And over time the accuracy of the time-codes slowly fades (if there's a lot of correction done to the text, eg if the STT is not very accurate). #### alignment - server side - Aeneas Some folks have had some success running server side alignment. For example in [pietrop/fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) the editor was one giant content editable div, and on save it would send to the server plain text version (literally using `.innerText`). @frisch1 then server side would then align it against the original media using the [aeneas aligner](https://github.com/readbeyond/aeneas) by @pettarin. Aeneas converts the text into speech (TTS) and then uses that wave form to compare it against the original media to very quickly produce the alignment, restoring time-codes, either at word or line level depending on your preferences. Aeneas uses dynamic time warping of math frequency capsule coefficient algo (🤯). You can read more about how Aeneas works in the [How Does This Thing Work?](https://github.com/readbeyond/aeneas/blob/4d200a050690903b30b3d885b44714fecb23f18a/wiki/HOWITWORKS.md) section of their docs. This approach for [fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) was some what successful, Aeneas is very fast. However - the alignment is only done on save to the database. - If a user continues to edit the page over time more and more of the time-codes will disappear until the refresh the page, and the "last saved and aligned" transcript gets fetch from the db. - And to set this up as "a reusable component" you'd always have to pair with a server side module to do the alignment - Aeneas is great but in it's current form does not exist as an npm module (as far as I am aware of?) it's written in python and has some system dependencies such as ffmpeg, TTS engine etc..
side note on word level time-codes and clickable words I should mention that in [fact2_transcription_editor](https://github.com/pietrop/fact2_transcription_editor) you could click on individual words to jump to corresponding point in the media. With something equivalent to ```html Hello ... ``` A pattern I had first come across in [hyperaud.io's blog description of "hypertranscripts"](https://hyperaud.io/blog/hypertranscripts/) by @maboa & @gridinoc
#### STT based alignment - Gentle Some folks have also used [Gentle](https://github.com/lowerquality/gentle), by @maxhawkins, a forced aligner based on Kaldi as a way to get alignment info. I've personally [used it for autoEdit2](https://autoedit.gitbook.io/user-manual/setup-stt-apis/setup-stt-apis-gentle) as an open source offline option for users to get transcriptions. But I haven't used it for alignment, as STT based alignment is slower then TTS one. #### alignment - client side - option 1 (stt-align) Another option is to run the alignment client side. by doing a diff between the human corrected (accurate) text and the timed text from the STT engine, and to transpose the time-codes from the second to the first.
some more background and info on this solution This solution was first introduced by @chrisbaume in [bbc/dialogger](https://github.com/bbc/dialogger) ([presented at textAV 2017](https://textav.gitbook.io/textav-event/projects/bbc-dialogger)) it modified [CKEditor](https://ckeditor.com) (at the time draftJS was not around yet) and run the alignment server side in a custom python module [sttalign.py](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py) With @chrisbaume's help I converted the python code into a node module [stt-align-node](https://github.com/pietrop/stt-align-node) which is used in [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) and [slate-transcript-editor](https://github.com/pietrop/slate-transcript-editor) one issue in converting from python to [the node version](https://github.com/pietrop/stt-align-node/blob/master/src/align/index.js) is that for diffing python uses the [difflib](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py#L31) that is [part of the core library](https://docs.python.org/3/library/difflib.html) while in the node module [we use](https://github.com/pietrop/stt-align-node/blob/master/src/index.js#L27) , [difflib.js](https://github.com/qiao/difflib.js) which might not be as performant (❓ 🤷‍♂️ ) When a word is inserted, (eg was not recognized by the STT services and the users adds it manually) in this type of alignment there are no time-codes for it. Via interpolation of time-codes of neighboring words, we bring back add some time-codes. In the python version the time-codes interpolation is done via [numpy](https://numpy.org) to [linearly interpolate the missing times](https://github.com/pietrop/stt-align-node/blob/master/docs/python-version/sttalign.py#L3-L16) In the [node version the interpolation](https://github.com/pietrop/stt-align-node/blob/master/src/align/index.js#L61-L95) is done via the [everpolate](http://borischumichev.github.io/everpolate/#linear) module and again it might not be as performant as the python version (❓ 🤷‍♂️ ).
However in [@bbc/react-transcript-editor](https://github.com/bbc/react-transcript-editor) and [slate-transcript-editor](https://github.com/pietrop/slate-transcript-editor) initially every time the user stopped typing for longer then a few seconds, we'd trigger a save, which was proceeded by an alignment. This became very un-performant, especially for long transcriptions, (eg approximately over 1 hour) because whether you change a paragraph or just one word, it would run the alignment across the whole text. Which turned out to be a pretty expensive operation. This lead to removing user facing word level time-codes in the slateJs version to improve performance on long transcriptions. and removing auto save. However, on long transcription, even with manual save, sometimes the `stt-align-node` module can temporary freeze the UI for a few seconds 😬 or in the worst case scenario sometimes even crash the page 😓 ☠️
more on retaining speaker labels after alignement There is also a workaround for handling retaining speaker labels at paragraph level when using this module to run the alignment. The module itself only aligns the words. To re-introduce the speakers, you just compare the aligned words with the paragraphs with speaker info. [Example of converting into slateJs format](https://github.com/pietrop/slate-transcript-editor/blob/master/src/util/update-timestamps/index.js#L15-L47) or into [dpe format from slateJs](https://github.com/pietrop/slate-transcript-editor/blob/pagination/src/util/export-adapters/slate-to-dpe/index.js#L14-L40)
Which is why in PR https://github.com/pietrop/slate-transcript-editor/pull/30 we are considering pagination. But before a closer look into that, let's consider one more option. #### alignment - client side - option 2 (web-aligner) Another option explored by @chrisbaume at textAV 2017 was to make a [webaligner](https://github.com/chrisbaume/webaligner) ([example here](http://pietropassarelli.com/webaligner-example/index.html) [and code of the example here](https://github.com/chrisbaume/webaligner-example)) to create a ~simple~ lightweight client-side forced aligner for timed text levering the browser audio API ([AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext)), and doing computation similar to Aeneas(? not sure about this last sentce?). This option is promising, but was never fully fleshed out to a usable state. It might also only work when aligning small sentences due to browser's limitations(?). #### 5. Overtyper Before considering pagination, a completely different approach to the UX problem of correcting text is [overtyper](https://github.com/alexnorton/overtyper) by @alexnorton & @maboa from textAV 2017. Where you follow along a range of words being hiligteed as the media plays. To correct you start typing from the last correct word you heard until the next correct one, so that the system can adjust/replace/insert all the once in between. This makes the alignment problem a lot more narrow, and new word timings can be more easily computed. This is promising, but unfortunately as far as I know there hasn't been a lot of user testing to this approach to validate.

Pagination

For slate-transcript-editor we've been using (option 3) client side alignment with stt-align-node to restore time-codes on user's save.

However because of the performance issue on large transcription, we've been considering pagination - PR https://github.com/pietrop/slate-transcript-editor/pull/30 but run into a few issues.

For now we can assume the transcription comes as one payload from the server. And I've been splitting it into one hour chunks.

The idea is that the slateJs editor can be responsible for the text editing part, and alignment, save, export in various format can be done in the parent component to provide a cohesive interface that for example. Merges all the pages into one doc before exporting but only updates the current chunk when saving.

questions
  1. Should these chunk be store in the state of the parent component or is there a performance issue in doing that in react?
  2. Should you loop through the chunks in the render method and only display the current one? Is this a good pattern to use or is there a better one?
  3. Should the state of the slateJS editor be held in the parent component? (this seemed to cause a performance issue)
  4. on change of the slateJS editor, do we just update the current chunk or also the array of chunks? (this seemed to cause a performance issue)

I am going to continue to try a few other things here but any thoughts, ideas 💡 or examples on react best practice when dealing with react to paginate text editors are much appreciated.

Quick disclaimer: Last but not least this is my best effort to collect info on this topic in order to frame the problem and hopefully get closer to a solution, if some of these are not as accurate as they should be, feel free to let me know in the comments.

pietrop commented 3 years ago

also relevant via @xshy216 https://github.com/pietrop/slate-transcript-editor/issues/10#issuecomment-722846904

pietrop commented 3 years ago

An update on the latest thinking, and a chance to recap some of the current progress.

Deferring pagination exploration

After talking to @rememberlenny I decided defer trying out pagination in favor of an approach that tries to single out paragraphs that have changed and align only those.

Options for aligning only the paragraphs that changed

There's two ways in which you could do that,

  1. one is to use slateJs api for onKeyDown and/or onChange and keep some sort of list that keeps track where the changes in the doc have been made, based on user cursor and selection. For now this seems laborious.
  2. The other is to compare the paragraphs and single out those that have changed, and only run the alignment for those (using pietrop/stt-align-node) .

Word level timings and clickable words

Slightly unrelated, but relevant, similar to the DraftJs approach of using entities in @bbc/react-transcript-editor, but somehow way more performant, we can bring back clickable words, by adding them as an attribute to the text child node, along side the text attribute.

example ```js [ { "type": "timedText", "speaker": "James Jacoby", "start": 1.41, "previousTimings": "0", "startTimecode": "00:00:01", "children": [ { "text": "So tell me, let’s start at the beginning.", "words": [ { "end": 1.63, "start": 1.41, "text": "So" }, { "end": 2.175, "start": 1.63, "text": "tell" }, { "end": 2.72, "start": 2.175, "text": "me," }, { "end": 2.9, "start": 2.72, "text": "let’s" }, { "end": 3.14, "start": 2.9, "text": "start" }, { "end": 3.21, "start": 3.14, "text": "at" }, { "start": 3.21, "end": 3.28, "text": "the" }, { "end": 4.88, "start": 4.346666666666666, "text": "beginning." } ] } ] }, ... ```

We can add onDoubleClick to the renderLeaf component.

 onDoubleClick={handleTimedTextClick}

And use a getSelectionNodes helper function to use slateJS selection/cursor position to return timecode of current word. Assuming text has not been edited using selection offset vs word's objects list text char count gives you the start time of the word being clicked on (if that makes sesnse?).

Paragraph changes

Option 2 assumes that paragraphs are not changing, eg splitting or merging a paragraph. OR that this is being handled separately from the alignment process.

For now I've disabled splitting and merging paragraph, via Enter and Backspace key (eg if Backspace is at beginning of the paragraph). However you can still delete multiple words within one paragraph.

example ```js // TODO: revisit logic for // - splitting paragraph via enter key // - merging paragraph via delete // - merging paragraphs via deleting across paragraphs const handleOnKeyDown = (event) => { console.log('event.key', event.key); if (event.key === 'Enter') { // intercept Enter event.preventDefault(); console.log('For now cdisabling enter key to split a paragraph, while figuring out the aligment issue'); return; } if (event.key === 'Backspace') { const selection = editor.selection; console.log('selection', selection); console.log(selection.anchor.path[0], selection.focus.path[0]); // across paragraph if (selection.anchor.path[0] !== selection.focus.path[0]) { console.log('For now cannot merge paragraph via delete across paragraphs, while figuring out the aligment issue'); event.preventDefault(); return; } // beginning of a paragrraph if (selection.anchor.offset === 0 && selection.focus.offset === 0) { console.log('For now cannot merge paragraph via delete, while figuring out the aligment issue'); event.preventDefault(); return; } } ```

option 2. identify paragraphs that have changed

One idea from @rememberlenny is that If you don't run the alignment on every keystroke or when the user's stop typing (which are both possible optimization to consider - via @gridinoc) then you need to find which paragraphs have changed, and only align those.

I found that lodash differenceWith is pretty snappy. And you can specify a comparator function. Which allows you to for example only compare the text attribute of the child node, as opposed to the whole paragraph block.

example ```js /** * Update timestamps usign stt-align module * @param {*} currentContent - slate js value * @param {*} words - list of stt words * @return slateJS value */ // TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes. // TODO: in stt-align-node if all the words are completely diff, it seems to freeze. // Look into why in stt-align-node github repo etc.. export const updateTimestampsHelper = (currentContent, dpeTranscript) => { // TODO: figure out if can remove the cloneDeep option const newCurrentContent = _.cloneDeep(currentContent); // trying to align only text that changed // TODO: ideally, you save the slate converted content in the parent component when // component is initialized so don't need to re-convert this from dpe all the time. const originalContentSlateFormat = convertDpeToSlate(dpeTranscript); // TODO: add the ID further upstream to be able to skip this step. // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the // slate editor content. // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown const currentSlateContentWithId = currentContent.map((paragraph, index) => { const newParagraph = { ...paragraph }; newParagraph.id = index; return newParagraph; }); const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); // This gives you a list of paragraphs that have changed, and because we added indexes via ids, we can easily and quickly identify them and run alignment on individual paragraphs. ```

option 2. align individual paragraphs that have changed

Once you have the individual paragraphs that need aligning you can run alignSTT on each and replace them in the slateJs editor current content value list of paragraphs.

example ```js const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); diffParagraphs.forEach((diffParagraph) => { // TODO: figure out if can remove the cloneDeep option let newDiffParagraph = _.cloneDeep(diffParagraph); let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text); newDiffParagraph.children[0].words = alignedWordsTest; // also adjust paragraph timecode // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary // but keeping because eventually will handle use cases where paragraphs are modified. newDiffParagraph.start = alignedWordsTest[0].start; newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start); newCurrentContent[newDiffParagraph.id] = newDiffParagraph; }); return newCurrentContent; }; ```
fulll example ```js // TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes. // TODO: in stt-align-node if all the words are completely diff, it seems to freeze. // Look into why in stt-align-node github repo etc.. export const updateTimestampsHelper = (currentContent, dpeTranscript) => { // TODO: figure out if can remove the cloneDeep option const newCurrentContent = _.cloneDeep(currentContent); // trying to align only text that changed // TODO: ideally, you save the slate converted content in the parent component when // component is initialized so don't need to re-convert this from dpe all the time. const originalContentSlateFormat = convertDpeToSlate(dpeTranscript); // TODO: add the ID further upstream to be able to skip this step. // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the // slate editor content. // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown const currentSlateContentWithId = currentContent.map((paragraph, index) => { const newParagraph = { ...paragraph }; newParagraph.id = index; return newParagraph; }); const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator); diffParagraphs.forEach((diffParagraph) => { // TODO: figure out if can remove the cloneDeep option let newDiffParagraph = _.cloneDeep(diffParagraph); let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text); newDiffParagraph.children[0].words = alignedWordsTest; // also adjust paragraph timecode // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary // but keeping because eventually will handle use cases where paragraphs are modified. newDiffParagraph.start = alignedWordsTest[0].start; newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start); newCurrentContent[newDiffParagraph.id] = newDiffParagraph; }); return newCurrentContent; }; ```

up next.

See latest commit of the PR https://github.com/pietrop/slate-transcript-editor/pull/36 for more details on this.

Refactor/clean up

Also

And

pietrop commented 3 years ago

Some thoughts after recent refactor https://github.com/pietrop/slate-transcript-editor/pull/36

on 💡 ~You are not allowed to completely delete a paragraph?~ as it could make things easier for alignment, as a paragraph will always have timed words associated with it.

This would mean that you are running the STT align against the most recent re-alignment, as opposed to the original STT data. But would give flexibility to handle changing paragraphs. As well as skip alignment of paragraphs that might not needed.

Still unsure of frequency of the alignment, def on save, but not sure if it should happen on pause typing, maybe not for now. Need to check performance against longer file (1 to 5 hours example)

pietrop commented 3 years ago

Updated storybook demo https://pietropassarelli.com/slate-transcript-editor/ to reflect this PR https://github.com/pietrop/slate-transcript-editor/pull/36

Screen Shot 2021-02-18 at 12 16 37 AM

to recap

Some things I am not sure about

extra / stretch goal

pietrop commented 3 years ago

PR https://github.com/pietrop/slate-transcript-editor/pull/36 recap

pietrop commented 3 years ago

this has been merged to master and deployed alpha releases to test it out and make it easier to revert back if needed. Will bump up the version when there's more confidence that it was a successful refactor that didn't introduce 🐞

closing this for now.