Combined theirstory changes

jshearer commented 3 years ago

We have been working on adding some features to the transcript editor that were requested by some of our customers, as well as fixing any bugs we find. This PR contains the past few weeks of work. Importantly, there's also a pull request in align-diarized-text to significantly improve some of the performance issues we were seeing in longer transcripts, so I left out the commits in here bumping that dependency as I imagine you'll do it when you merge :)

Add VTT with speakers export option to include speaker info in the regular .vtt export, in the correct vtt syntax
Add VTT with speakers and paragraphs to generate an export in .vtt format, but instead of splitting by max characters on screen, we split by paragraph.
Add Word (OHMS) export option to generate a .docx file in the format expected by OHMS
Fix bug in subtitle exporter where multiple spaces anywhere in a transcript would throw off the word count resulting in a crash
New: Run align-diarized-text in a background task to prevent hanging the browser. This should work by itself, but we paired it with the webpack worker-plugin which works nicely.

If you'd like to review/merge these individually I'm happy to make separate PRs for each feature/fix, it was just easier to package it as one :)

pietrop commented 3 years ago

leaving a note here for integration with pietrop/digital-paper-edit-client and pietrop/digital-paper-edit-electron about using Node.js features in Electron's Web Workers

pietrop commented 3 years ago

Thanks for this @jshearer !

I had one issue when i tried the storybook locally, and I went to export a word doc with OHMS in one window and a vtt with speakers in another, to checkout the output. The cursor started spinning, and didn't get any output (waited a few minutes)

I was able to export plain text, and plain text + spakers but not plaintext + timecodes.

This was after upgrading to the new module

- "align-diarized-text": "^1.0.8",
+"align-diarized-text": "^1.0.9",

I also reverted back to 1.0.8 and got the same issue

Let me know if you have any ideas on what could be causing this, and whether you get the same issue on your end?

btw, I had removed node_modules and package-lock.json before running npm install.

pietrop commented 3 years ago

I also don't fully understand what insertTimecodesInline (from inline-interval-timecodes ) does?

jshearer commented 3 years ago

I'm wondering if you saw any errors in the console when exporting? I just tested and am able to export OHMS, vtt, plaintext+timecodes etc. FWIW this is all running on transcripts that came from Google with speaker diarization and were run through gcp-to-dpe v2.

insertTimecodesInline is admittedly kind of a weird feature: the OHMS export wants timecodes ever interval (30s in this case) inserted in the middle of the text. Here's an example, notice the [00:00:30], [00:01:00] etc Screen Shot 2020-11-30 at 11 54 53 PM

pietrop commented 3 years ago

ok, cool. No didn't see anything significant in the console 🤷‍♂️.

In theory once the transcript is converted to dpe format, shouldn't make too much of difference where it came from originally, unless there are bugs in the converters.

Did you try it in the storybook locally as well?

http://localhost:6006/?path=/story/slatetranscripteditor--demo

pietrop commented 3 years ago

ok, I am not sure why, but I think I might have figured it out 🤔 🥳

Something not quiet right about convertSlateToDpeAsync, not sure what exactly tho. but if I change restoreTimecodes in src/util/restore-timecodes to use converSlateToDpe instead of convertSlateToDpeAsync then I am able to export

[x] plain txt with timecodes
[x] word document with timecodes
[x] docx word (OHMS)
[x] ♻️ alignment button works

it actually seems quite snappy at restoring timecodes (and I was still on align-diarized-text v1.0.8 so go figure, will have to re try with v 1.0.9 might be even faster 🎉 )

import convertDpeToSlate from '../dpe-to-slate';
+import converSlateToDpe, { convertSlateToDpeAsync } from '../export-adapters/slate-to-dpe/index.js';
- import { convertSlateToDpeAsync } from '../export-adapters/slate-to-dpe/index.js';

const restoreTimecodes = async ({ slateValue, transcriptData }) => {
  console.log('restoreTimecodes', slateValue, transcriptData);
+  const aligneDpeData = await converSlateToDpe(slateValue, transcriptData);
-  const aligneDpeData = await convertSlateToDpeAsync(slateValue, transcriptData);
  const alignedSlateData = convertDpeToSlate(aligneDpeData);
  return alignedSlateData;
};

export default restoreTimecodes;

was not able to export vtt and other caption files tho, I'd need to look more closely at that.

jshearer commented 3 years ago

Okay, I'll look more into this tomorrow (today? :p)

We also did notice a bug where sometimes if you try to bulk-change a speaker name while an export is happening, the browser will hang like before, and also that bulk-changing a speaker name more than once doesn't seem to work, so those are also on my list here.

Are the transcripts you were using to cause these issues public/somewhere I can see them to try and reproduce myself?

On Tue, Dec 1, 2020, 12:20 AM Pietro notifications@github.com wrote:

ok, I am not sure why, but I figured it out.

Something not quiet right about convertSlateToDpeAsync, not sure what exactly tho. but if I change restoreTimecodes in src/util/restore-timecodes to use converSlateToDpe instead of convertSlateToDpeAsync then I am able to export

plain txt with timecodes

word document with timecodes

docx word (OHMS)

import convertDpeToSlate from '../dpe-to-slate';+import converSlateToDpe, { convertSlateToDpeAsync } from '../export-adapters/slate-to-dpe/index.js';- import { convertSlateToDpeAsync } from '../export-adapters/slate-to-dpe/index.js';

const restoreTimecodes = async ({ slateValue, transcriptData }) => { console.log('restoreTimecodes', slateValue, transcriptData);+ const aligneDpeData = await converSlateToDpe(slateValue, transcriptData);- const aligneDpeData = await convertSlateToDpeAsync(slateValue, transcriptData); const alignedSlateData = convertDpeToSlate(aligneDpeData); return alignedSlateData; };

export default restoreTimecodes;

was not able to export vtt and other caption files tho, I'd need to look more closely at that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pietrop/slate-transcript-editor/pull/16#issuecomment-736225588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBKPDU3ANFNME4SFRUQHMTSSR4LLANCNFSM4UFM42UA .

pietrop commented 3 years ago

Yeah, so to reproduce you can run

npm start

That starts the storybook locally at

http://localhost:6006/?path=/story/slatetranscripteditor--demo

It be the same as pietropassarelli.com/slate-transcript-editor but with the local changes obv.

You can see the various stories here slate-transcript-editor/src/components/1-SlateTranscriptEditor.stories.js#L30-L46

They are meant to exemplify various initialization as we well as edge cases. Eg long transcripts etc

For transcriptions think I am mostly using this one soleio-dpe and video Originally from PBS frontline transparency project on YouTube.

Let me know if you got any questions :)

pietrop commented 3 years ago

ok, yeah captions export wasn't working for me for thee same reason as the other export - using convertSlateToDpeAsync instead of converSlateToDpe in getEditorContent in src/components/index.js

pietrop commented 3 years ago

Removed the service worker part, and merged the progress so far. We can do a separate PR for the service work, if you go that to work.

pietrop / slate-transcript-editor

Combined theirstory changes #16