stevenwaterman / Lexoral

The Transcription Tool You Wish Existed
https://lexoral.com
GNU General Public License v3.0
35 stars 3 forks source link

Cannot extract audio from non-streamable video codecs #66

Closed stevenwaterman closed 2 years ago

stevenwaterman commented 2 years ago

We currently re-encode the uploaded files using ffmpeg on google cloud functions. To prevent having to store the entire file in memory, we use read and write streams to do the encoding. However, this doesn't work for all videos. Specifically, it seems like the issue is when the audio track is Stream #0:1 instead of #0:0 according to ffprobe.

stevenwaterman commented 2 years ago

Essentially there's header data at the end of the file that we need but can't read https://github.com/fluent-ffmpeg/node-fluent-ffmpeg/issues/823

The only way to really solve this is to have the whole video available to us. That means either transcoding in the browser or swapping to something like cloud-run and downloading the file to disk.

stevenwaterman commented 2 years ago

Using Web Audio API to transcode isn't really viable atm because there's no way to prevent it resampling the audio which makes it harder to transcribe. We can maybe just live with that for now, and then once https://github.com/WebAudio/web-audio-api/issues/30 is fixed we can change it behind the scenes

guest271314 commented 2 years ago

If I understand the issue you are trying to extract audio from a recording?

stevenwaterman commented 2 years ago

Note that this is more of a personal bug tracker for Lexoral, rather than issues with any of the underlying APIs

I'm trying to get the raw audio data from a local video file (eventually just any local media file) so that it can be uploaded to a server. I'm currently in the process of switching over to the new Web Codecs API which should solve my issues

guest271314 commented 2 years ago

Locally you can utilize mkvmerge. The last time I checked Firefox and Chromium set the audio and video differently, see

https://github.com/guest271314/native-messaging-mkvmerge/blob/d506fae2d8423999ec092830b1df3d4c331be6ee/app/native-messaging-mkvmerge.js#L216-L229

I'm currently in the process of switching over to the new Web Codecs API which should solve my issues

I would be hesitant to make such a proclaimation. WebCodecs API has issues itself. Although the claim that WebCodecs solves this/that issue in Web Audio issues, that is simply not the case, e.g., when Opus codec is used WebCodecs never resamples back to the original input sample rate https://bugs.chromium.org/p/chromium/issues/detail?id=1260519.

stevenwaterman commented 2 years ago

I need something that can run in the browser, this is a public-facing app. I've had to rule out using the Web Audio API because it has poor support for both streaming audio and custom sample rates (especially both at the same time), which I need. There's no way for me to get the raw audio from a 4 hour file at 8000Hz sampling rate without crashing the browser. If that produces more issues then I should at least be able to work around them, and the Web Audio API maintainers seem pretty clear that dealing with raw audio data means you should be looking at the web codecs api instead

guest271314 commented 2 years ago

Yes, claims are made about WebCodecs API that have not been vetted. Try for yourself and see. "WebCodecs solves ..." is an incomplete claim in and of itself.

You can certain stream audio with AudioWorklet or MediaStreamAudioDestinationNode. I have not looked at your code, though appears interesting.

Technically you can serve a file using ServiceWorker and stream to HTMLMediaElement.

Re

can run in the browser, this is a public-facing app

Native Messaging provides the capability to run any local shell scripts or native applications on the local machine and get the result in the browser.

There's no way for me to get the raw audio from a 4 hour file at 8000Hz sampling rate without crashing the browser.

Yes, you can. Serve the data in chunks and stream using MediaStreamTrackGenerator.

I will help if I am able. What precisely is the requirement?

stevenwaterman commented 2 years ago

I'm not sure which of us is confused re: you suggesting streaming to a MediaStreamAudioDestinationNode - I'm not trying to play the audio back locally, but rather to upload it to a server. If I've misunderstood what you mean then please do correct me. I can already load the audio into a HTMLAudioElement just by creating an object URL and using that as the source.

run any local shell scripts or native applications

When I say external users I mean customers on their personal computers at home where I can make no guarantees about what programs they have installed. I'm restricted to browser APIs and anything I can load as WASM.

MediaStreamTrackGenerator

I've actually used MediaStreamTrackProcessor a little already, and in my head it was part of the Web Codecs API. However as far as I can see there's no polyfill available for it, so I'd have to rule it out on the basis of only being supported in Chrome.

I've put the requirements below, grateful for any direction you can provide. It does seem like the Web Codecs API is the solution, but maybe your different perspective will come up with something else.


Requirements:

At the highest level, the requirement is that a user comes to the website with a media file to upload (could be audio, could be video, could be any codec, any sample rate, any number of channels, we frankly know nothing about it). We can make no guarantees* about which browser they are using or what they have installed locally, the site is for 3rd party users. Optionally we do some kind of processing on the file before upload. They upload it to our servers. Optionally we do some kind of processing on the file after upload. Now on the servers we have an audio file.

* I'm not trying to support every browser, like I don't care about IE, but an up-to-date version of firefox should be able to upload files even if it's a slightly degraded experience.

Non-functional requirements are:

guest271314 commented 2 years ago

When I say external users I mean customers on their personal computers at home where I can make no guarantees about what programs they have installed. I'm restricted to browser APIs and anything I can load as WASM.

You can provide the applications that will be used for Native Messaging.

I've actually used MediaStreamTrackProcessor a little already, and in my head it was part of the Web Codecs API. However as far as I can see there's no polyfill available for it, so I'd have to rule it out on the basis of only being supported in Chrome.

This is the proposal https://github.com/alvestrand/mediacapture-transform.

What part of the requirements are you having issues with?

guest271314 commented 2 years ago

Are you essentially providing a STT (Speech To Text) or extraction of lyrics, etc. service?

stevenwaterman commented 2 years ago

Yes, Lexoral is an STT transcription service at its core. My issue with the requirements is that it just rules out a lot of possible avenues. The only one left that I see is to transcode the local file to FLAC in the browser. I had this working with OfflineAudioContext.decodeAudioData followed by libflacjs but it meant that the whole file was stored in memory (multiple times!). My search for a stream-based replacement for decodeAudioData led me to the web codecs api so I was in the process of swapping over to that

guest271314 commented 2 years ago

You can certainly create one or more WAV files in the browser without WASM then if necessary reassemble to WAV files to a single file in the server.

stevenwaterman commented 2 years ago

Yes, that would work. How would you propose I create multiple WAV files from a single local file without decoding (and storing in memory) the whole file at once? The hard part isn't the uploading, or even the encoding, it's performing the decoding with streams

guest271314 commented 2 years ago

See https://plnkr.co/edit/1yQd8ozGXlV9bwK6?preview. For every N seconds send to server. You essentially wind up with S16LE or just raw floats on the server.

stevenwaterman commented 2 years ago

This doesn't really help because the hard part of getting the raw audio data is done with MediaStreamTrackProcessor which is only available in chrome and there's no polyfills I can use.

guest271314 commented 2 years ago

See WavAudioEncoder class. That issue is about WebCodecs not decoding back to original input sample rate, which is important for these types of applications (TTS, SST), as we generally want to preserve content without the ability of someone in the chain saying "it's different".

If you want to support Chrome and Firefox, use AudioWorklet.

From your description the user already has the file, thus there is no encoding to really be done. User enters site with a WAV file, if not a WAV, you create one or more, then upload.

guest271314 commented 2 years ago

Your application only deals with files, right, not live streaming input?

guest271314 commented 2 years ago

FYI you could probably just use PocketSphinx https://github.com/syl22-00/pocketsphinx.js entirely in the browser.

stevenwaterman commented 2 years ago

Yes it's just local files. Eventually it'd be nice to support livestreaming input, but that's not a concern at the minute. I don't really see how WavAudioEncoder solves the problem - it looks like it's just getting fed data from the MediaStreamTrackProcessor/Generator classes, which is the problem i'm trying to solve.

AudioWorklet would be pretty straight forward, but how can I actually get the Web Audio API to call it? I can't call decodeAudioData, and can't see any way to use a media source in the OfflineAudioContext

I'm currently using the Google Cloud Speech API's enhanced model, and even that is barely good enough. I'm not willing to consider any regression in transcription quality like PocketSphinx, that defeats the point of Lexoral existing

guest271314 commented 2 years ago

A WAV file after the first 44 bytes can be represented by just a series of TypedArray's. Simple slice the data and send to server using fetch(). AudioWorklet is more for live-streaming, to read/send data to server in "real-time". WavAudioEncoder can be used if your server requires WAV files, though your server should support raw PCM.

guest271314 commented 2 years ago

Technically, if you want to compress, you can use opusenc and opusdec - which does decompress to original sample rate.

guest271314 commented 2 years ago

Have you compared PocketSphinx to Google services?

stevenwaterman commented 2 years ago

Simple slice the data and send to server

You keep avoiding the bit I am actually having issues with - how can i get the raw audio data without using the MediaStreamTrack classes? Everything else I've already had working in the past

guest271314 commented 2 years ago

If you slice the first 44 bytes of the WAV file you can slice any discrete chunks in N slices of raw PCM and send to server.

stevenwaterman commented 2 years ago

I have an MP4 file, not raw PCM. I can't just stick a WAV header on an MP4 and have it work

stevenwaterman commented 2 years ago

In this whole pipeline the only bit I need help with is the [any media file eg MP4] -> PCM

guest271314 commented 2 years ago

Why is MP4 used if you are only transcribing audio? Make sure the user only uploads WAV files. You can convert MP4 to WAV in the browser.

guest271314 commented 2 years ago

There is an ffmpeg.wasm. I prefer using Native Messaging to avoid the WASM memory usage.

guest271314 commented 2 years ago

I have to go to a gig. I'll be around later.

stevenwaterman commented 2 years ago

You can convert MP4 to WAV in the browser

How? This is the bit I need help with.

I was looking at ffmpeg.wasm, but it just seems like a worse version of the audio codec api (with polyfills to fall back to a wasm transcoding lib)

stevenwaterman commented 2 years ago

Why is MP4 used if you are only transcribing audio? Make sure the user only uploads WAV files

The users are not technical and it's a lot easier for them to just click to upload a file and have it be transcribed, even if it's a video. Asking the users to download audacity and extract the audio from a video will turn them away

guest271314 commented 2 years ago

See https://github.com/ffmpegwasm/ffmpeg.wasm

guest271314 commented 2 years ago

I was looking at ffmpeg.wasm, but it just seems like a worse version of the audio codec api (with polyfills to fall back to a wasm transcoding lib)

Not sure what you mean by "seems"? Have you tested?

guest271314 commented 2 years ago

You just need to extract the audio to raw PCM, correct? Codecs is not really applicable.

stevenwaterman commented 2 years ago

"Seems like a worse version" because if I use the WebCodecs API then I get native performance when it's supported, and worse WASM performance when it's not supported. Using ffmpeg.wasm I get WASM performance everywhere. On top of that, WebCodecs API support will only get better over time.

Codecs are applicable, because the audio is encoded using a codec - I need a decoder.

guest271314 commented 2 years ago

On top of that, WebCodecs API support will only get better over time.

I am doubtful of that, becuase I have tested the API.

Tests verify everything.

I suggest you test before making conclusions.

stevenwaterman commented 2 years ago

I tested it with the WebCodecs API. It works. I appreciate you trying to help but your advice just made my life harder. I'm going to be locking this issue now.