Closed stevenwaterman closed 2 years ago
Essentially there's header data at the end of the file that we need but can't read https://github.com/fluent-ffmpeg/node-fluent-ffmpeg/issues/823
The only way to really solve this is to have the whole video available to us. That means either transcoding in the browser or swapping to something like cloud-run and downloading the file to disk.
Using Web Audio API to transcode isn't really viable atm because there's no way to prevent it resampling the audio which makes it harder to transcribe. We can maybe just live with that for now, and then once https://github.com/WebAudio/web-audio-api/issues/30 is fixed we can change it behind the scenes
If I understand the issue you are trying to extract audio from a recording?
Note that this is more of a personal bug tracker for Lexoral, rather than issues with any of the underlying APIs
I'm trying to get the raw audio data from a local video file (eventually just any local media file) so that it can be uploaded to a server. I'm currently in the process of switching over to the new Web Codecs API which should solve my issues
Locally you can utilize mkvmerge
. The last time I checked Firefox and Chromium set the audio and video differently, see
I'm currently in the process of switching over to the new Web Codecs API which should solve my issues
I would be hesitant to make such a proclaimation. WebCodecs API has issues itself. Although the claim that WebCodecs solves this/that issue in Web Audio issues, that is simply not the case, e.g., when Opus codec is used WebCodecs never resamples back to the original input sample rate https://bugs.chromium.org/p/chromium/issues/detail?id=1260519.
I need something that can run in the browser, this is a public-facing app. I've had to rule out using the Web Audio API because it has poor support for both streaming audio and custom sample rates (especially both at the same time), which I need. There's no way for me to get the raw audio from a 4 hour file at 8000Hz sampling rate without crashing the browser. If that produces more issues then I should at least be able to work around them, and the Web Audio API maintainers seem pretty clear that dealing with raw audio data means you should be looking at the web codecs api instead
Yes, claims are made about WebCodecs API that have not been vetted. Try for yourself and see. "WebCodecs solves ..." is an incomplete claim in and of itself.
You can certain stream audio with AudioWorklet
or MediaStreamAudioDestinationNode
. I have not looked at your code, though appears interesting.
Technically you can serve a file using ServiceWorker
and stream to HTMLMediaElement
.
Re
can run in the browser, this is a public-facing app
Native Messaging provides the capability to run any local shell scripts or native applications on the local machine and get the result in the browser.
There's no way for me to get the raw audio from a 4 hour file at 8000Hz sampling rate without crashing the browser.
Yes, you can. Serve the data in chunks and stream using MediaStreamTrackGenerator
.
I will help if I am able. What precisely is the requirement?
I'm not sure which of us is confused re: you suggesting streaming to a MediaStreamAudioDestinationNode
- I'm not trying to play the audio back locally, but rather to upload it to a server. If I've misunderstood what you mean then please do correct me. I can already load the audio into a HTMLAudioElement
just by creating an object URL and using that as the source.
run any local shell scripts or native applications
When I say external users I mean customers on their personal computers at home where I can make no guarantees about what programs they have installed. I'm restricted to browser APIs and anything I can load as WASM.
MediaStreamTrackGenerator
I've actually used MediaStreamTrackProcessor
a little already, and in my head it was part of the Web Codecs API. However as far as I can see there's no polyfill available for it, so I'd have to rule it out on the basis of only being supported in Chrome.
I've put the requirements below, grateful for any direction you can provide. It does seem like the Web Codecs API is the solution, but maybe your different perspective will come up with something else.
Requirements:
At the highest level, the requirement is that a user comes to the website with a media file to upload (could be audio, could be video, could be any codec, any sample rate, any number of channels, we frankly know nothing about it). We can make no guarantees* about which browser they are using or what they have installed locally, the site is for 3rd party users. Optionally we do some kind of processing on the file before upload. They upload it to our servers. Optionally we do some kind of processing on the file after upload. Now on the servers we have an audio file.
* I'm not trying to support every browser, like I don't care about IE, but an up-to-date version of firefox should be able to upload files even if it's a slightly degraded experience.
Non-functional requirements are:
When I say external users I mean customers on their personal computers at home where I can make no guarantees about what programs they have installed. I'm restricted to browser APIs and anything I can load as WASM.
You can provide the applications that will be used for Native Messaging.
I've actually used
MediaStreamTrackProcessor
a little already, and in my head it was part of the Web Codecs API. However as far as I can see there's no polyfill available for it, so I'd have to rule it out on the basis of only being supported in Chrome.
This is the proposal https://github.com/alvestrand/mediacapture-transform.
What part of the requirements are you having issues with?
Are you essentially providing a STT (Speech To Text) or extraction of lyrics, etc. service?
Yes, Lexoral is an STT transcription service at its core. My issue with the requirements is that it just rules out a lot of possible avenues. The only one left that I see is to transcode the local file to FLAC in the browser. I had this working with OfflineAudioContext.decodeAudioData
followed by libflacjs
but it meant that the whole file was stored in memory (multiple times!). My search for a stream-based replacement for decodeAudioData
led me to the web codecs api so I was in the process of swapping over to that
You can certainly create one or more WAV files in the browser without WASM then if necessary reassemble to WAV files to a single file in the server.
Yes, that would work. How would you propose I create multiple WAV files from a single local file without decoding (and storing in memory) the whole file at once? The hard part isn't the uploading, or even the encoding, it's performing the decoding with streams
See https://plnkr.co/edit/1yQd8ozGXlV9bwK6?preview. For every N seconds send to server. You essentially wind up with S16LE or just raw floats on the server.
This doesn't really help because the hard part of getting the raw audio data is done with MediaStreamTrackProcessor
which is only available in chrome and there's no polyfills I can use.
See WavAudioEncoder
class. That issue is about WebCodecs not decoding back to original input sample rate, which is important for these types of applications (TTS, SST), as we generally want to preserve content without the ability of someone in the chain saying "it's different".
If you want to support Chrome and Firefox, use AudioWorklet
.
From your description the user already has the file, thus there is no encoding to really be done. User enters site with a WAV file, if not a WAV, you create one or more, then upload.
Your application only deals with files, right, not live streaming input?
FYI you could probably just use PocketSphinx https://github.com/syl22-00/pocketsphinx.js entirely in the browser.
Yes it's just local files. Eventually it'd be nice to support livestreaming input, but that's not a concern at the minute. I don't really see how WavAudioEncoder
solves the problem - it looks like it's just getting fed data from the MediaStreamTrackProcessor/Generator classes, which is the problem i'm trying to solve.
AudioWorklet
would be pretty straight forward, but how can I actually get the Web Audio API to call it? I can't call decodeAudioData
, and can't see any way to use a media source in the OfflineAudioContext
I'm currently using the Google Cloud Speech API's enhanced model, and even that is barely good enough. I'm not willing to consider any regression in transcription quality like PocketSphinx, that defeats the point of Lexoral existing
A WAV file after the first 44 bytes can be represented by just a series of TypedArray's. Simple slice the data and send to server using fetch()
. AudioWorklet
is more for live-streaming, to read/send data to server in "real-time". WavAudioEncoder
can be used if your server requires WAV files, though your server should support raw PCM.
Technically, if you want to compress, you can use opusenc
and opusdec
- which does decompress to original sample rate.
Have you compared PocketSphinx to Google services?
Simple slice the data and send to server
You keep avoiding the bit I am actually having issues with - how can i get the raw audio data without using the MediaStreamTrack classes? Everything else I've already had working in the past
If you slice the first 44 bytes of the WAV file you can slice any discrete chunks in N slices of raw PCM and send to server.
I have an MP4 file, not raw PCM. I can't just stick a WAV header on an MP4 and have it work
In this whole pipeline the only bit I need help with is the [any media file eg MP4]
-> PCM
Why is MP4 used if you are only transcribing audio? Make sure the user only uploads WAV files. You can convert MP4 to WAV in the browser.
There is an ffmpeg.wasm. I prefer using Native Messaging to avoid the WASM memory usage.
I have to go to a gig. I'll be around later.
You can convert MP4 to WAV in the browser
How? This is the bit I need help with.
I was looking at ffmpeg.wasm
, but it just seems like a worse version of the audio codec api (with polyfills to fall back to a wasm transcoding lib)
Why is MP4 used if you are only transcribing audio? Make sure the user only uploads WAV files
The users are not technical and it's a lot easier for them to just click to upload a file and have it be transcribed, even if it's a video. Asking the users to download audacity and extract the audio from a video will turn them away
I was looking at ffmpeg.wasm, but it just seems like a worse version of the audio codec api (with polyfills to fall back to a wasm transcoding lib)
Not sure what you mean by "seems"? Have you tested?
You just need to extract the audio to raw PCM, correct? Codecs is not really applicable.
"Seems like a worse version" because if I use the WebCodecs API then I get native performance when it's supported, and worse WASM performance when it's not supported. Using ffmpeg.wasm
I get WASM performance everywhere. On top of that, WebCodecs API support will only get better over time.
Codecs are applicable, because the audio is encoded using a codec - I need a decoder.
On top of that, WebCodecs API support will only get better over time.
I am doubtful of that, becuase I have tested the API.
Tests verify everything.
I suggest you test before making conclusions.
I tested it with the WebCodecs API. It works. I appreciate you trying to help but your advice just made my life harder. I'm going to be locking this issue now.
We currently re-encode the uploaded files using ffmpeg on google cloud functions. To prevent having to store the entire file in memory, we use read and write streams to do the encoding. However, this doesn't work for all videos. Specifically, it seems like the issue is when the audio track is Stream #0:1 instead of #0:0 according to ffprobe.