pietrop / slate-transcript-editor

A React component to make correcting automated transcriptions of audio and video easier and faster. Using the SlateJs editor.
https://pietrop.github.io/slate-transcript-editor
Other
75 stars 33 forks source link

After saved, new words created, duplicates. #10

Closed xshy216 closed 3 years ago

xshy216 commented 3 years ago

Describe the bug

"slate-transcript-editor": "0.0.15",

3cbfe40c-99a9-463a-b58e-39d14e02cdcb.wav.dpe.json.zip

I am using the slate editor. It has some issues on the export function, when export with time code, it changed the transcript, duplicates.

{"words":[
{"start":4.11,"confidence":0.779731,"end":4.5,"text":"Yeah.","id":0,"index":0},
{"start":5.03,"confidence":0.852385,"end":5.48,"text":"Okay.","id":1,"index":1},
{"start":6.18,"confidence":0.793509,"end":6.5,"text":"Now,","id":2,"index":2},
{"start":6.5,"confidence":0.821012,"end":6.82,"text":"so","id":3,"index":3},
{"start":6.86,"confidence":0.754611,"end":7.24,"text":"probably","id":4,"index":4},
{"start":7.81,"confidence":0.870976,"end":8.05,"text":"would","id":5,"index":5},
....],
"paragraphs":[
{"id":0,"start":4.11,"end":4.5,"speaker":2},
{"id":1,"start":5.03,"end":5.48,"speaker":1},
{"id":2,"start":6.18,"end":15.08,"speaker":2},
{"id":3,"start":15.37,"end":15.98,"speaker":1},
...]}'

Changed to:

{"words":[
{"text":"Okay.","start":3.9399999999999977,"end":4.259999999999998},
{"text":"Now,","start":4.259999999999998,"end":4.579999999999998},
{"text":"Okay.","start":4.579999999999998,"end":4.899999999999999},
{"text":"Now,","start":4.899999999999999,"end":5.219999999999999},
{"text":"Okay.","start":5.219999999999999,"end":5.539999999999999},
{"text":"Now,","start":5.539999999999999,"end":5.859999999999999},
{"end":6.18,"start":5.859999999999999,"text":"Okay."},
{"end":6.5,"start":6.18,"text":"Now,"},
{"end":6.86,"start":6.5,"text":"So"},
{"end":7.81,"start":6.86,"text":"probably"},
{"end":8.05,"start":7.81,"text":"would"},
...],
"paragraphs":[
{"speaker":"2","start":4.11,"end":4.259999999999998,"id":"0"},
{"speaker":"1","start":5.859999999999999,"end":6.5,"id":"1"},
{"speaker":"2","start":6.5,"end":10.754999999999999,"id":"2"},
{"speaker":"Speaker A","start":10.754999999999999,"end":15.37,"id":"3"},
...]}'
pietrop commented 3 years ago

👋 thanks for flagging this, Does this happen only when export with time code? And for all time code export option or just some?

xshy216 commented 3 years ago

It happened when save it as well, all time code export option have same problem.

pietrop commented 3 years ago

Ok, makes sense, when it saves it runs time code re alignment.

What do you use for speech to text before converting it to DPE format?

pietrop commented 3 years ago

I can’t seem to reproduce in storybook, what version of slate transcript editor are you on?

Update: sorry saw you saw in first post 0.0.15

xshy216 commented 3 years ago

I am using Azure, Xfyun etc. This file is from Azure, I converted to DPE.

overZellis133 commented 3 years ago

@pietrop, we are seeing this, and we had a student today at American University have their transcript rendered pretty unusable when some words were replicated thousands of times across different portions of their transcript. We are using Google STT before converting to DPE. We are seeing the issue sometimes upon saving.

pietrop commented 3 years ago

Thanks for flagging this @overZellis133 , it be good to take a close look at the sample data

pietrop commented 3 years ago

To recap our convo

I am not sure if this is caused by the conversion of the data provided to SlateJs. It takes DPE format.

For GCP, I made a converter, pietrop/gcp-to-dpe, in latest v2, this is refactored (removing intermeidate draftJs conversion, as it origially came from @bbc/react-transcript-editor), and needs/uses GCP Speaker diarization to break paragraphs on speaker change.

So worth trying using that, and see if issue still persists.

xshy216 commented 3 years ago

Thanks. I will try this GCP later sometime.

To make @bbc/react-transcript-editor work better for long hours files, I customize it to have pagination for the transcript, load a portion into the editor, won't hang the browser. It works for my project.

And add the paragraph mode, color of text changing, retrieve original transcript, align with original to get the color of changes (if copy and paste, instead of editing).

[image: image.png]

Pietro notifications@github.com 于2020年10月28日周三 上午7:31写道:

To recap our convo

I am not sure if this is caused by the conversion of the data provided to SlateJs. It takes DPE format.

For GCP, I made a converter, pietrop/gcp-to-dpe https://github.com/pietrop/gcp-to-dpe, in latest v2, this is refactored (removing intermeidate draftJs conversion, as it origially came from @bbc/react-transcript-editor), and needs/uses GCP Speaker diarization to break paragraphs on speaker change.

So worth trying using that, and see if issue still persists.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pietrop/slate-transcript-editor/issues/10#issuecomment-717601148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZF5RMULADK3X67AMBRSV3SM5J3PANCNFSM4QNA2NHQ .

pietrop commented 3 years ago

ah, that's interesting, wasn't able to see the image tho?

you are using @pietrop/slate-transcript-editor or earlier version @bbc/react-transcript-editor?

xshy216 commented 3 years ago

Earlier one.

On Fri, 6 Nov 2020, 11:34 Pietro, notifications@github.com wrote:

ah, that's interesting, wasn't able to see the image tho?

you are using @pietrop/slate-transcript-editor or earlier version @bbc/react-transcript-editor ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pietrop/slate-transcript-editor/issues/10#issuecomment-722790250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZF5RINGNSXQVVWULLDLLLSONVDLANCNFSM4QNA2NHQ .

pietrop commented 3 years ago

cool, yeah it be good to see what the pagination look like if you might be able to share that as a PR in @bbc/react-transcript-editor? since that's a probl we are still trying to solve for that project

xshy216 commented 3 years ago

Ok, I will try to make a PR there.

I hadn't been coding for 20 years, just came back for a project recently. I did not code it well as separate component, just made it work, need some time to make it as a PR.

For your reference, upload a screen shot here. ScreenHunter 72

pietrop commented 3 years ago

Yeah no rush, and no worries if the code isn't perfect, it just be interesting to see the code/PR to see the concept/idea behind the pagination in draftJS 😊

xshy216 commented 3 years ago

Hm, I did outside of draftJS, pagation in to editor, not in draftJS. I added one more props to the editor, to pass whole transcript, but only take the one page to the editor (DPE, then draftJS) to edit, when page change, save page into memory (slice of the array), take the required page to editor. When choose to Save, save the whole transcript to local/database.

pietrop commented 3 years ago

closing in favor of this,
https://github.com/pietrop/digital-paper-edit-electron/issues/74#issuecomment-844404845 but feel free to raise another issue if you run into it again. And provide as much information as possible, as well as detailed steps to reproduce the issue, including sample json etc...