kaixxx / noScribe

Cutting edge AI technology for automated audio transcription. A nice GUI for OpenAIs Whisper and pyannote (speaker identification)
GNU General Public License v3.0
449 stars 82 forks source link

Feature Request: SRT, WebVTT, and Plain Text Exports #25

Closed candideu closed 5 months ago

candideu commented 1 year ago

Hi there! Thank you for making this app. I found out about it through David Mbugua's review, and wanted to try it for myself.

I really like the Word+macro export, but also wanted to suggest having .srt, .webvtt, and plain .txt export options as well.

Here is an example from Audapolis:

image

image

kaixxx commented 1 year ago

Thank you for the suggestion (and for pointing me to David Mbugua's nice review which I didn't know about). Plain txt output is defenitly on my list for the next version. Maybe I can also add .srt and .webvtt, we will see.

awagner-mainz commented 6 months ago

I would be interested in feeding transcriptions into dedicated tools/platforms for multimedia corpora like exmaralda or oral-history.digital or Oral History Metadata Synchronizer (OHMS).

I have to do some more research on file formats myself and try to suggest more concrete feature requests. For the time being, I just wanted to mention this audience and use case, because it is somewhat different from the QDA one.

Maybe it turns out that something like srt or even whisper's own json format is already sufficient if you have a good workflow around it.

Is the (whisper) json format accessible in some way?

berndmoos commented 6 months ago

The latest EXMARaLDA previews (https://exmaralda.org/en/preview-version/) have an import option for Whisper's JSON. (https://exmaralda.org/en/2023/01/15/exmaralda-and-automatic-speech-recognition-asr/) There may be some remaining issues with robustness towards Whisper's sometimes unclean output, but in principle, this is ready-to-use. I am actively working on improving this and also observing noScribe development, so I'd be happy to participate in this issue :-)

kaixxx commented 6 months ago

@awagner-mainz & @berndmoos Thank you for the suggestions. I would be happy to support this. Outputting the JSON from Whisper is not a big problem. But this would then be without speaker separation, just a continous text. If you also need speaker information, SRT or VTT would be the better format I guess. The latest EXMARaLDA preview should support this too as far as I understand.

berndmoos commented 6 months ago

Yes, there is both an SRT and a VTT import in EXMARaLDA. Whatever noScribe can output (I don't have it installed myself), I can try to import. If / as soon as someone can provide me with a set of example files (no audio required), I can put some work into this.

https://github.com/Exmaralda-Org/exmaralda/issues/463

kaixxx commented 6 months ago

@berndmoos I think instead of adding the particular output format of noScribe to EXMARaLDA, the better approach would be to improve the support of noScribe for standardized output formats like SRT. This was on my list anyways. I can have a look next week, ok?

kaixxx commented 5 months ago

I have now added plain text and WebVTT output to noScribe. It's already in the source code but I did not make an official release yet (installer etc.).

@berndmoos Can you test if you can import it into EXMARaLDA? I've attached an example file. Ein Gespräch mit Heikedine Körting.zip

It's the first 2 minutes of this Interview: https://www.youtube.com/watch?v=ap_xtj1kPF0 I often use this as a test case because it has a lot of back and forth between the speakers and some overlapping speech as well. Yes, there are some errors in the transcript - e.g. "Sehjungfrau" instead of "Seejungfrau", a typical error for an AI transcript.

I have implemented the WebVTT export in such a way that every segment has a voice-tag attached, identifying the speaker (Speaker 0 or <v S00> in this case):

4
00:00:30.120 --> 00:00:30.940
<v S00>Ist das herrlich.

This should allow you to import the segments into different speaker-tracks in EXMARaLDA.

Note that the voice tag can also be missing, e.g. if the user chooses to turn off speaker indentificaction in noScribe:

4
00:00:30.120 --> 00:00:30.940
Ist das herrlich.

There is also some metadata in the header of the VTT file.

In order to change the output to VTT or plain text in noScribe, you have to choose this file format in the file dialog for the output file. (But as I sad, for now its only in the source code, not yet released.)

BTW: This is how the same transcript looks in Word:

HeidedineWord

...and as plain text:

HeidedineTxt

berndmoos commented 5 months ago

That looks good. EXMARaLDA will import timestamps and text right away. It currently ignores the voice tags, but I'll take care of that. Won't be long :-)

image

berndmoos commented 5 months ago

The latest EXMARaLDA Windows preview will now import VTT and assign tiers to speakers according to the voice tags:

image

This includes fixes to the VTT library: https://github.com/noophq/subtitle/pull/29

kaixxx commented 5 months ago

Very nice. EXMARaLDA is also a great tool for me to check how precise noScribe works. I can already see that I could improve the timecodes in cases of overlapping speech. What about also importing the path to the audio from the vtt? Right now, it is in the info block at the top. But parsing it from there is not ideal since the label ("Audiodatei: " in this case) might change depending on the language of the UI. I could add a dedicated line with a label that will never change.

berndmoos commented 5 months ago

There doesn't seem to be a straightforward mechanism in VTT to embed a link to a video / audio. We could agree to have something like:

NOTE
media: c:\documents\interview.wav

However, since I am parsing the VTT with an external library, that would be a bit awkward to implement. I think I'd rather try: 1) Check if there is an audio or video file with the same name in the same directory. If yes, take that 2) If not, ask the user right after import if they want to link an audio/video file

kaixxx commented 5 months ago

Ok, I've now included the following two additional lines in the vtt output:

<empty line>
NOTE media: path/to/media/file.wav

You can decide if you want to use it or not. Note that I always use the forward slash / in paths, since this works in macOS, Linux and Windows as well.

New short example file: vtt2.zip

I'm going to close this now. A new release will be ready in a couple of days.