rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
6.67k stars 337 forks source link

Support for `audio` data based projects #2852

Open imflash217 opened 1 year ago

imflash217 commented 1 year ago

Is your feature request related to a problem? Please describe.

I primarily work with audio data and it is particularly challenging to visualize different stages of audio data like waveforms or spectrograms. It becomes more challenging if the data is multi-channel audio or very long audio. Currently I have to use jupyter-notebook to display and play my audio. The context switching is very tiring. Also, it is more challenging to exactly relate the audio waveform at a particular timestamp and its corresponding spectrograms. This becomes worse, if we are working of multimodal models like Automatic Speech Recognition (ASR) systems which require text visualization with its corresponding audio.

Describe the solution you'd like

I am very impressed with the video support that is provided by rerun api. I would like to see a similar first-class support for audio based projects too with following features:

  1. [important] play my audio as a time-series data
  2. [important] plot and visualize the changing spectrograms as the audio is playing to precisely pinpoint the timestamp and its corresponding extracted features. Support for various power-spectrums like MFCC would be extremely helpful.
  3. [important] ability ot play individual channels separately or play multiple channels combined. This is essential for various tasks such as source-separation, denoising.
  4. [important] For various tasks like Automatic Speech Recognition (ASR) we would want to see a correlation between the timestamp-window and the respective text produced by the ASR model. This would be scalable across waveform, power-spectrums and ASR text-output so we can comprehend everything at once.
  5. [nice-to-have] ability to apply various types of windows (eg. hanning, hamming etc) and filters (eg. low-pass, high-pass, band-pass etc.) on a audio or a batch to quick experiment on-the-fly.

Describe alternatives you've considered

As far as I know, there is not a comprehensive tool that supports these features, yet. I have to use Jupyter-notebook and librosa most of my experimentation and the biggest challenge is making sure that the timestamp in audio is exactly same as in the power-spectrums.

Additional context

emilk commented 1 year ago

One fundamental thing we need to implement before we start working on this is log events with a duration. Currently each log event is associated with a single instance (a video is just a set of frames, each logged individually). This won't work for audio: you'd like to log e.g. a two second sound in one log call. We will also need this functionality when implementing proper video codecs.

lunixbochs commented 1 year ago

I'm very interested in logging and labeling realtime audio when tracing Talon with Rerun!

I'll note that Talon's audio is realtime/continuous/infinite, but it might make more sense efficiency wise to log it in larger chunks than in say 30ms intervals. If we did that, I would want an easy way to backdate a longer chunk of streamed audio to the actual timestep/frame in which it originated during logging.

I think plotting and visualizing audio features is very useful, but I don't want Rerun to calculate the features (spectrogram, windowing, filters, etc) for me. Those are labels / data processing I can ship with the audio signal and they're in my domain of expertise to make sure the data I'm sending you to render is exactly what I want.

Audio Timeline Space

I think I want a kind of "audio timeline" space, which looks sort of like an audacity track and maybe supports several audio channels (vertically stacked), and maybe supports other views of the same audio like spectrograms (which I'm happy to embed in the trace myself).

Annotations

Here's an extreme example of what duration annotations might look like in audacity: screenshot_2023-08-08_at_3 48 43_pm

Spatial audio

I think about spatial audio as well, e.g. several audio tracks with distinct 3d positions that can change over time. I wouldn't worry about playing the audio back spatially at first, but being able to select an audio track and see it highlighted + move around in the 3d scene might be really useful.

emilk commented 1 year ago

This looks like a nice, simple audio library for rust:

CatalinVoss commented 11 months ago

Very interested in audio support as well. Would also love to be able to visualize alongside 2D matrices where each row covers a fixed time window (may be a probability vector over an alphabet, a spectogram entry, or similar).

+1 to text as well.

cboulay commented 6 months ago

+1 for spectrogram. I'm hoping to use a spectrogram to visualize streaming (unbounded / realtime) brain signals, not audio, but I think the solution will work equally well for either.

I don't think rerun should be responsible for doing the spectral transformation. This is too personal and domain specific. (Pre-Filtering? Windowing? Log-transform? FFT or Wavelets? Multi-taper? Frequency resolution? Window duration? Window step size?). It should be up to the user to do their spectral transformation then log their spectrum / spectra.

The Space view should be something like a mix of the Tensor view and TimeSeries view:

Until something like this is implemented, I might try plotting a scalar for every time x frequency, for only a single channel, and then coloring each scalar independently, probably with a SeriesPoint and square markers.

emilk commented 3 weeks ago

For decoding audio (that is not simple PCM), we should be able to use ffmpeg over CLI, like we do for video (see https://github.com/rerun-io/rerun/pull/7962/)