w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
1.02k stars 138 forks source link

How would Web Codecs support extracting PCM data for a specific time range? #28

Closed JohnWeisz closed 4 years ago

JohnWeisz commented 5 years ago

Hey,

As you are likely aware, there is a huge and painful limitation in the Web Audio API, and accessing only a specific time range of audio sample data is not possible in any remotely feasible fashion without jamming the entire audio file into memory.

We are looking forward to finally getting this limitation behind using Web Codecs. Are there any plans for somehow supporting extracting raw PCM audio data from a specific time range, say from 5 seconds to 10 seconds? (obviously given an audio file not shorter than 10 seconds for this specific example)

guest271314 commented 5 years ago

Is the requirement to extract a specific time slice from the file without reading the file?

JohnWeisz commented 5 years ago

@guest271314 Without reading the entire file, i.e. the requirement to be able to extract a short slice anywhere from an audio file that's possibly even several hours long.

guest271314 commented 5 years ago

@JohnWeisz Have considered and proposed similar functionality for video in brief see https://github.com/w3c/mediacapture-record/issues/166, et al.

If the structure of the file is the same throughout it should be possible to estimate where in the file 5 seconds is and where 10 seconds is. That is, given an ArrayBuffer subarray chunks can be extracted then OfflineAudioContext() and AudioBufferSourceNode can be utilized to refine with AudioBufferSourceNode.start([when][, offset][, duration]).

JohnWeisz commented 5 years ago

@guest271314 Thanks, but unfortunately, I fail to understand how that solves the issue.

First, ArrayBuffer is already a problem, as it's an in-memory storage, so you can't load the entire file, not even with a FileReader. This is part of the problem we are trying to solve, after all.

Second, since you currently have no other option than decodeAudioData, you have to load the entire file into memory to process it through OfflineAudioContext.

Technically it's a solution to play the file back through <audio> and record a portion of it, but it doesn't really support faster than realtime conversion, which is ultra-slow.

guest271314 commented 5 years ago

Another option is to perform the task of creating time slices (in seconds or bytes) exactly once then you will have the ability to serve and merge any portion of the media thereafter.

guest271314 commented 5 years ago

The entire file does not need to be loaded into memory for decodeAudioData(). You can estimate where a given time range is within the media.

The simplest approach (though does require loading the entire file) would be to use <audio> element with media fragment identifier #t=5,10 to record and get a Blob of the exact time slice. Or, again, perform that task exactly once with OfflineAudioContext.startRendering() and AudioBufferSourceNode.start(), slicing the file into N second chunks which you can concatenate or merge in any arrangement required, as audio buffers can be concatenated, streamed, downloaded, etc.

guest271314 commented 5 years ago

Yes, you can read the entire file using fetch() with ReadableStream(). Have requested, read and downloaded a 180MB JSON file using the same. During the read when the bytes are analyzed the 5 second time slice can be recognized or estimated then the fetch can be aborted when the 10 second time slice is reached, resulting the the required 5 second time slice.

guest271314 commented 5 years ago

How to you propose to read the file to determine where a specific time slice is without reading without reading the entire file?

Only read the metadata parts, which could be anywhere in the file, depending on which application encoded the media? Encoder parameters are not necessarily consistent between applications, e.g., Chromium and Firefox implementations of MediaRecorder are not consistent.

JohnWeisz commented 5 years ago

Yes, we precisely want to avoid reading the entire file, and instead read only some chunks from it. This currently is only possible (somewhat) using an <audio> element, but it's way too slow for general use. Hence why I'm wondering if Web Codecs has anything planned for a call like:

let pcmData = getRawPCMSamples(blob, 5, 10)

This example demonstrates a theoretical API for extracting audio PCM samples from 5 seconds to 10 seconds, from a source Blob, even if it's a 30 hours long audio file behind that Blob.

Now I understand many formats don't support fully random access behavior, and have to be at least scanned through before knowing what byte offset to even start looking at. However, that doesn't mean the entire file has to be read into memory, and especially not entirely at once.

The implementation of my proposal could handle all of this behind the scenes.

JohnWeisz commented 5 years ago

That said though, there is definitely going to be a solution if this proposal ends up supporting non compressed output formats. You could simply convert to, say, a WAV, and then easily read chunks from that.

guest271314 commented 5 years ago

At

let pcmData = getRawPCMSamples(blob, 5, 10)

blob appears to be the entire file, correct?

In that case you can still utilize Media Fragment URI specification

let slice = URL.createObjectURL(blob) + "#t=5,10"

fetch the Blob URL, convert to ArrayBuffer, pass to OfflineAudioContext().

guest271314 commented 5 years ago

How do you get blob? From a server or created locally client side?

JohnWeisz commented 5 years ago

@guest271314

Either from an <input type="file" ... /> element, or local IndexedDB, so basically client side.

guest271314 commented 5 years ago

Then the application would not know beforehand the container. The entire file would need to be read to get the metadata and/or extract the underlying audio and/or video from specific time slices of the media. Unless WebCodecs develops a parser for each container that could be potentially used to extract and re-encode the required time slices of media, e.g., mkvmerge --split. However, if each timeslice if encoded individually, for example in 1 second or 5 second time slices, then MediaSource

if (chunks.length) {
  const chunk = chunks.shift():
  sourceBuffer.changeType(mimeCodec); // can be same or different mimeCodec
  sourceBuffer.appendBuffer(chunk);
}

(or Media Fragments URI) could be used to to playback (and re-record if necessary using MediaRecorder). Note, at currently Chromium if the resolution of the video track changes and the media is recorded using captureStream() the tab crashes.

guest271314 commented 5 years ago

See How to use Blob URL, MediaSource or other methods to play concatenated Blobs of media fragments?

JohnWeisz commented 5 years ago

@guest271314 It's obvious, at least for most compressed formats, that the file has to be read. However, if you also want sample-level access, currently you do this by loading the entire thing into memory, which is inefficient.

If the Web Codecs proposal could read the file in its native implementation efficiently, by streaming through the file, we could get proper access to streaming PCM sample data access without having to keep the entire file in memory.

JohnWeisz commented 5 years ago

So, what I propose, implementation-wise, is that:

  1. the Web Codecs API provides a streaming-based API for reading PCM audio samples from any part of a file, asynchronously, without requiring the entire file loaded in memory (as currently only possible through the Web Audio API)
  2. the Web Codecs API implementations read through the file in an efficient, streaming-based approach in advance to "discover" everything required to facilitate the functionality in the first point above (such as chunks, headers, etc., depending on format)
guest271314 commented 5 years ago

without requiring the entire file loaded in memory

How would that be possible?

Is the requirement to only play back the media fragments? Or to also offer the individual media fragments extracted for download?

JohnWeisz commented 5 years ago

Is the requirement to only play back the media fragments? Or to also offer the individual media fragments extracted for download?

This is an interesting question, for me, personally, simply having access to sample data in Float32Array would be sufficient, but I can see other possible uses here.

For the record, my use case is accessing waveform data for navigable visualization purposes (i.e. drawing parts of a waveform).

JohnWeisz commented 5 years ago

How would that be possible?

The exact same way as e.g. playing back audio from a specific timestamp is possible, only in this case, the decoded audio data is offered in JS-compatible container objects, instead of being written to the output buffer directly.

guest271314 commented 5 years ago

Technically subarray() can be utilized

           const chunk = ab.subarray(start, start + trackOffsets[index]);
            sourceBuffer.appendBuffer(chunk);
            start += trackOffsets[index];
            ++index;
//...
if (start < ab.byteLength) {
            console.log(start, index);
            const chunk = ab.subarray(start, start + trackOffsets[index]);
            sourceBuffer.appendBuffer(chunk);
            start += trackOffsets[index];
            ++index;
          }

Though how do you know where you are in the file without metadata?

The exact same way e.g. playing back audio from a specific timestamp is possible

The file metadata is required for the ability to seek. You could use range requests, though still metadata is needed and this is client-side.

guest271314 commented 5 years ago

You could test and estimate the total duration and where you want to extract from if you are only dealing with Float32Array and use subarray(). Though does the media also contain video?

JohnWeisz commented 5 years ago

The file metadata is required for the ability to seek. You could use range requests, though still metadata is needed and this is client-side.

Yes, however, the metadata can be mapped without requiring the entire file be loaded in memory.

That said though, I think we kinda misunderstand each other, the problem here is a missing web API. We want to get random access to audio sample data, and we don't want to load the entire audio into memory. Currently:

guest271314 commented 5 years ago

You can use subarray() to get only portions of the file. Again, testing the extraction points to determine where you are relavant to a start time and end time of the extracted portions.

JohnWeisz commented 5 years ago

The problem is that subarray is a method on ArrayBuffer, which by definition is the in-memory container for the file. So the entire file is already in memory by the time you can even use subarray, at which point you are not really gaining anything.

JohnWeisz commented 5 years ago

In case why this isn't clear as a problem:

Imagine you have a 20 hour long audio file, and you want to get 2 minutes of sample data from it somewhere in the middle. Currently, you can either:

  1. use AudioContext.decodeAudioData to load the entire 20 hours decoded into an AudioBuffer, most likely crashing the runtime because not many devices have enough RAM to handle this monster (this is roughly 25GB for 44.1 khz, stereo, float32, again all in RAM)
  2. use HTMLAudioElement to run the playback for 2 minutes and record the playback output

None of these are convenient, or even feasible in most cases.

guest271314 commented 5 years ago

The restriction of

without jamming the entire audio file into memory

is not technically possible, particularly with a Blob, which could already be stored in memory, see this answer at Where is Blob binary data stored? or on hard disk, meaning in either case the entire file is already accessible by random at random points.

If the media is finite, which a 20 hour media file is, and the encoded media is consistent then the exact points which need to be accessed can be calculated mathematically.

The rudimentary algorithm would be to divide the total duration by the number of total frames. Once the partitions are known you can extract any given part of the set mathematically. If the encoded media is variable the exact portions can still be extracted by compensating for the differences between discrete variable rate encoded media.

Thus, the concept of

we precisely want to avoid reading the entire file

is not viable in the first instance as a Blob is already the entire file. Perhaps clarification of exactly what is meant by that term is necessary.

Yes, the simple solution would still be to use Media Fragments URI with an <audio> or <video> element or AudioContext createBufferSource() with setting the appropriate values at start() and record the media fragments with MediaRecorder.

The alterantive solution would be to calulate the total number of samples or frames within the finite amount of sample or frames then extract the required time slices mathematically.

For example,

  const frames = [];
  // metadata for the current set of current frames given known variable input frames
  frames.push([{
    duration: video.currentTime,
    frameRate: 0,
    width: video.videoWidth,
    height: video.videoHeight
  }]);
  // current frame, 30 frames per second
  // adjustable depending on expected resulting file size
  // could be 60 frames per second
  frames[frames.length - 1].push(canvas.toDataURL("image/webp"));
  // get duration of the current set of variable frames
  const currentFrames = frames[frames.length - 1];
  const [frame] = currentFrames;
  frame.duration = video.currentTime - frame.duration;
  frame.frameRate = (frame.duration * 60) / currentFrames.length -1;
  // write the frames at calculated frame duration
 for (const frame of frames) {
    const [{
      duration, frameRate, width, height
    }] = frame;
    console.log(frameRate);
    const framesLength = frame.length;
    const frameDuration = Math.ceil((duration * 1000) / framesLength);
    for (let i = 1; i < framesLength; i++) {
      videoWriter.addFrame(frame[i], frameDuration, width, height);
    }
  }

https://plnkr.co/edit/Inb676?p=info

Alternatively the frame duration can be calculated dynamically, instead of storing the frames separately, concept courtesy of @thenickdude

let firstFrame = false;
const inputSampleRate = 30;

let
  frameStartTime = 0,
  captureTimer = null;

const
  flushFrame = (video) => {
    let
      now = video.currentTime;

    if (frameStartTime) {
      WebmWriter.addVideoFrame(
        canvas.toDataURL("image/webp").split(",").pop()
      , (now - frameStartTime) * 1000
      );
    }

    frameStartTime = now;
  };
  // ...
  frameStartTime = 0;
  const captureFrame = () => {
    captureTimer = null;
    flushFrame(video);
    ctx.drawImage(video, 0, 0);
    captureTimer = setTimeout(captureFrame, 1000 / inputSampleRate);
  };
  captureFrame();

https://plnkr.co/edit/ThXd9MKYvEYq2kKyh8oc?p=preview

Each approach outputs similar results. One reads all the frames and calculates variable frame duration first then writes the file. One reads and writes the frames at the same time.

Given the current requirement you can calculate where

2 minutes of sample data from it somewhere in the middle

mathematically, set that estimated index as start and use the previously calculated frame or sample duration to determine the ending index. Set the included values to a separate ArrayBuffer, AudioBuffer, Blob, etc.

Using either example it is possible to get the total number of frames in the media, or Float32Array. The variable total duration is already known. Given a constant frame or sample rate you can extract any part of the Blob, for example, using blob.slice(start, end, contentType), as a Blob representation of a Float32Array will have the same size as a Float32Array `byteLength.

The Blob could be converted to an ArrayBuffer with Blob.arrayBuffer() then passed to AudioContext.decodeAudioData(). Which brings us back to the original starting point. The Blob could already be in memory or referencing a file on disk. Therefore using OfflineAudioContext() startRendering() with specific time slices passed to source buffer start() is a reasonable solution.

If what you are proposing is for an API to parse any file potentially containing any content and any possible variable sample rate or frame rate, the program has no way to determine what the file contains without reading the entire file.

The question then becomes what the most efficient means of extracting specific time slices of media from a media file containing unknown content is. "Efficient" is a difficult term to substantiate. All tests need to be performed on the same machine to have any significance. Online tests of "efficiency" are useless without the complete hardware and software listed. Even then the results could vary substantially due to technical limitations wholly unrelated to the program. "Efficient" would need to be clearly defined for the proposal, and exactly how "efficiency" is evaluated.

Unless you are suggesting that there is a means to extract specific time slices from a file containing unknown media content, though having a finite content length without reading the entire file? If so, can you describe the algorithm that you are suggesting achieve that requirement?

JohnWeisz commented 5 years ago

@guest271314

I appreciate your thoughtful response, but I think you are missing the point. A native implementation could easily stream through the file chunk-by-chunk to work with whatever needs to be done with the file, instead of first reading the entire file into memory and then operate on the in-memory data.

And since decoding/demuxing is already available in native implementations, this could be used to stream sample data through a JS-enabled API, precisely how the same audio data is currently streamed to the output buffer to play back audio using an HTMLAudioElement already.

Or am I missing something here?

guest271314 commented 5 years ago

A native implementation could easily stream through the file chunk-by-chunk to work with whatever needs to be done with the file

That is already possible using implemented JavaScript APIs. If you prefer you can use ReadableStream() , WritableStream() or an async generator, close() the reader or break the async generator when conditions are met. You can now also use AudioWorkletNode to achieve the same requirement.

And since decoding/demuxing is already available in native implementations, this could be used to stream sample data through a JS-enabled API

That is already possible. MediaSource() provides an API to start playback at any point in the SourceBuffer using appendWindowStart.

Have you actually tried using MediaSource?

instead of first reading the entire file into memory and then operate on the in-memory data

that statement needs absolute clarity. Kindly define what you mean by "instead of reading the file into memory". If the file is a Blob the file could already be in memory. Do you mean re- read the file?

It appears that you are expecting a JavaScript API to be able to extract specific time slices from a media file without reading or relying on the file metadata. Though you have not provided any formal algorithm reflecting how that will occur.

As described above, it is mathematically possible to extract any part of a "file" by determining the total number of samples, or frames, and the total duration.

JohnWeisz commented 5 years ago

That is already possible using implemented JavaScript APIs. If you prefer you can use ReadableStream() , WritableStream() or an async generator, close() the reader or break the async generator when conditions are met. You can now also use AudioWorkletNode to achieve the same requirement.

Yes, you are correct that this can be entirely done with JS, and it's not even particularly challenging with e.g. uncompressed WAV. You can slice up Blobs/Files and read the slices with async FileReader, then operate on the individual chunks, without requiring the entire file to be kept in memory.

In this case however, especially with more complex formats, you have to re-create complex decoders and demuxers in JS code, while they are already available natively in virtually every single browser, only they are unusable for the task at hand, because there are no JS-enabled APIs to use them.

JohnWeisz commented 5 years ago

That is already possible. MediaSource() provides an API to start playback at any point in the SourceBuffer using appendWindowStart.

It does, but you cannot access the decoded sample data (PCM) without recording the stream at a real-time speed, which is way too slow for many applications.

JohnWeisz commented 5 years ago

But I still feel we are going in circles here, so let me try put the original title question into an alternative phrasing: does WebCodecs plan to offer a way to convert a part of a source file? Like ffmpeg allows, for example, without being forced to always convert the entire file.

If so, given sufficient performance, it could be used to perform a conversion of a chunk to an uncompressed format, and then acquire PCM data from that in a relatively easy way.

guest271314 commented 5 years ago

It does, but you cannot access the decoded sample data (PCM) without recording the stream at a real-time speed, which is way too slow for many applications.

Am not certain what you mean by "access the decoded sample data". That goes back to whether the requirement is to playback or offer the media for download or other purpose.

does WebCodecs plan to offer a way to convert a part of a source file? Like ffmpeg allows, for example, without being forced to always convert the entire file.

If the claim is that ffmpeg does provide a means to achieve what you are describing, would suggest to post that exact code here at this issue so that the procedure you are describing can be reproduced in code without need for corresponding details which man ffmpeg can elaborate on. From own limited experience experimenting ffmpeg can currently be run from JavaScript at least two different ways. Therefore it is not necessary to wait on this API to specify and implementers to adopt and deploy anything.

JohnWeisz commented 5 years ago

Am not certain what you mean by "access the decoded sample data". That goes back to whether the requirement is to playback or offer the media for download or other purpose.

There are various other uses for having access to sample data, including offline audio analysis and static audio visualization.

From own limited experience experimenting ffmpeg can currently be run from JavaScript at least two different ways.

Mind sharing what these ways are? Obviously, you can make a request to a server with the possibly several GB sized audio file from a web browser (not really efficient), or execute a command in elevated environments, such as electron (limited to web-based desktop applications only). Or alternatively, you can compile ffmpeg to JS and use that, but that's again inefficient and without sufficient changes, you are again operating on in-memory contents.

So I'm really interested in how this is possible.

If the claim is that ffmpeg does provide a means to achieve what you are describing, would suggest to post that exact code here at this issue so that the procedure you are describing can be reproduced in code without need for corresponding details which man ffmpeg can elaborate on.

I'm afraid this is beyond my current knowledge without actually digging into ffmpeg myself. I'm requesting/proposing an API, not an exact implementation. What I know from experience is that ffmpeg can convert very long chunks of audio without taking up several dozen gigabytes of RAM like virtually every single JS-based solution.

guest271314 commented 5 years ago

In reverse order of the points addressed in you previous comment

I'm afraid this is beyond my current knowledge without actually digging into ffmpeg myself.

does WebCodecs plan to offer a way to convert a part of a source file? Like ffmpeg allows, for example, without being forced to always convert the entire file.

is your claim. It is your responsibility to demonstrate in a minimal verifiable example, or provide the primary source basis for the claim if you have not reproduced the output yourself, that your claim is true and accurate and can be reproduced. At least attempts to produce the expected output - utilizing any approaches - are necessary, for your own edification and substantiation.

Else the basis for the claim itself must be pure speculation as to FFmpeg being capable of performing a specific operation - until proven otherwise.

Mind sharing what these ways are? Obviously, you can make a request to a server with the possibly several GB sized audio file from a web browser (not really efficient), or execute a command in elevated environments, such as electron (limited to web-based desktop applications only). Or alternatively, you can compile ffmpeg to JS and use that, but that's again inefficient and without sufficient changes, you are again operating on in-memory contents.

So I'm really interested in how this is possible.

You essentially covered two of the possibilities.

1) You can run a local server (PHP; nodejs; Python; etc.) to execute native commands; 2) Use FFmpeg compiled to JavaScript (where a Worker can be used to execute FFmpeg compiled to JavaScript)

Another solution is to use Native Messaging to pass execute a native command, optionally passing values to the command.

An example of such an approach is described and implemented at https://github.com/guest271314/native-messaging-mkvmerge.

--

Am currently considering experimenting with an approach using Native File System and inotifywait for the ability to omit using Native Messaging to achieve the same output; the basic conceptual algorithm is

  1. Context: Window, WorkerGlobalScope, WorkletGlobalScope
  2. let nativeScripts = await self.requestNativeScripts(<Map> [[<"scriptName">, "/path/to/local/directory/sciptName"]])
  3. let writeScript = await nativeScripts.get("scriptName").write("#!/bin/sh ... doStuff(<${outputFileName}>)")\
  4. await writeScript.execute()
  5. let dir = await self.chooseFileSystemEntries({type: "openDirectory"})
  6. let result = dir.getFile(<outputFileName>)

which should provide the same result at main thread substituting the need to use some form of native script to observe changes to a file or directory for using Native messaging.

JohnWeisz commented 5 years ago

@guest271314

The Native Messaging approach is very interesting.

In the meantime I looked briefly into the ffmpeg matter and while I didn't yet dig into the part of the implementation where streaming-based conversion is done, see https://stackoverflow.com/questions/7945747/how-can-you-only-extract-30-seconds-of-audio-using-ffmpeg which at least explains how partial conversion is used.

The way ffmpeg accomplishes this is two-fold:

See https://ffmpeg.org/ffmpeg.html

This is very similar to how HTMLAudioElement does playback, with the major notable difference that playback is done faster than realtime, and the decoded audio stream is written to a file.


About

You essentially covered two of the possibilities. (1) You can run a local server (PHP; nodejs; Python; etc.) to execute native commands; (2) Use FFmpeg compiled to JavaScript (where a Worker can be used to execute FFmpeg compiled to JavaScript)

Indeed, as I said this is technically possible, but either rather inefficient, or extremely limiting. Compiling FFmpeg to JS duplicates logic that's already available natively, and requiring a local server does not allow a public web-based application from accessing this functionality (as I said, you can do this on your public server, but then you have to upload/download the conversion source and result, which is a huge payload).

guest271314 commented 5 years ago

The FFmpeg approach at the SO answer does not indicate that program achieve the requirement

Without reading the entire file

The procedure resembles using OfflineAudioContext.startRendering() https://webaudio.github.io/web-audio-api/#OfflineAudioContext

OfflineAudioContext is a particular type of BaseAudioContext for rendering/mixing-down (potentially) faster than real-time. It does not render to the audio hardware, but instead renders as quickly as possible, fulfilling the returned promise with the rendered result as an AudioBuffer.

and AudioBufferSourceNode.start() https://webaudio.github.io/web-audio-api/#AudioBufferSourceNode-methods

start(when, offset, duration)

e.g.,

// ...
context = new OfflineAudioContext(2, len, 44100);
      return Promise.all(data.map(function(buffer) {
          return audio.decodeAudioData(buffer)
            .then(function(bufferSource) {
              var source = context.createBufferSource();
              source.buffer = bufferSource;
              source.connect(context.destination);
              return source.start() // set when, offset, duration here
            })
        }))
        .then(function() {
          return context.startRendering()
        })
        .then(function(renderedBuffer) {
          // do stuff
        }) 

at Mixing two audio buffers, put one on background of another by using web Audio Api.

chcunningham commented 4 years ago

To answer the original question, this would be achieved by decoding an EncodedAudioChunk corresponding to the time range you're interested in. The decoder would output an AudioFrame which contains a timestamp and an AudioBuffer (type from Web Audio), containing planar float 32. Some handy spec links

https://wicg.github.io/web-codecs/#audiodecoder-interface https://wicg.github.io/web-codecs/#ref-for-dom-audioframe-buffer

This discussion covered a lot of topics, but I think this answers the main question. I'll go ahead and close this so. Feel free to file new issues with specific follow up questions.