Open imflash217 opened 1 year ago
One fundamental thing we need to implement before we start working on this is log events with a duration. Currently each log event is associated with a single instance (a video is just a set of frames, each logged individually). This won't work for audio: you'd like to log e.g. a two second sound in one log call. We will also need this functionality when implementing proper video codecs.
I'm very interested in logging and labeling realtime audio when tracing Talon with Rerun!
I'll note that Talon's audio is realtime/continuous/infinite, but it might make more sense efficiency wise to log it in larger chunks than in say 30ms intervals. If we did that, I would want an easy way to backdate a longer chunk of streamed audio to the actual timestep/frame in which it originated during logging.
I think plotting and visualizing audio features is very useful, but I don't want Rerun to calculate the features (spectrogram, windowing, filters, etc) for me. Those are labels / data processing I can ship with the audio signal and they're in my domain of expertise to make sure the data I'm sending you to render is exactly what I want.
I think I want a kind of "audio timeline" space, which looks sort of like an audacity track and maybe supports several audio channels (vertically stacked), and maybe supports other views of the same audio like spectrograms (which I'm happy to embed in the trace myself).
Here's an extreme example of what duration annotations might look like in audacity:
I think about spatial audio as well, e.g. several audio tracks with distinct 3d positions that can change over time. I wouldn't worry about playing the audio back spatially at first, but being able to select an audio track and see it highlighted + move around in the 3d scene might be really useful.
This looks like a nice, simple audio library for rust:
Very interested in audio support as well. Would also love to be able to visualize alongside 2D matrices where each row covers a fixed time window (may be a probability vector over an alphabet, a spectogram entry, or similar).
+1 to text as well.
+1 for spectrogram. I'm hoping to use a spectrogram to visualize streaming (unbounded / realtime) brain signals, not audio, but I think the solution will work equally well for either.
I don't think rerun should be responsible for doing the spectral transformation. This is too personal and domain specific. (Pre-Filtering? Windowing? Log-transform? FFT or Wavelets? Multi-taper? Frequency resolution? Window duration? Window step size?). It should be up to the user to do their spectral transformation then log their spectrum / spectra.
The Space view should be something like a mix of the Tensor view and TimeSeries view:
Until something like this is implemented, I might try plotting a scalar for every time x frequency, for only a single channel, and then coloring each scalar independently, probably with a SeriesPoint and square markers.
For decoding audio (that is not simple PCM), we should be able to use ffmpeg over CLI, like we do for video (see https://github.com/rerun-io/rerun/pull/7962/)
Is your feature request related to a problem? Please describe.
I primarily work with audio data and it is particularly challenging to visualize different stages of audio data like
waveforms
orspectrograms
. It becomes more challenging if the data is multi-channel audio or very long audio. Currently I have to usejupyter-notebook
to display and play my audio. The context switching is very tiring. Also, it is more challenging to exactly relate the audiowaveform
at a particular timestamp and its correspondingspectrograms
. This becomes worse, if we are working of multimodal models like Automatic Speech Recognition (ASR) systems which requiretext
visualization with its corresponding audio.Describe the solution you'd like
I am very impressed with the
video
support that is provided byrerun
api. I would like to see a similar first-class support for audio based projects too with following features:important
] play myaudio
as a time-series dataimportant
] plot and visualize the changingspectrograms
as the audio is playing to precisely pinpoint the timestamp and its corresponding extracted features. Support for various power-spectrums likeMFCC
would be extremely helpful.important
] ability ot play individualchannels
separately or play multiplechannels
combined. This is essential for various tasks such assource-separation
,denoising
.important
] For various tasks like Automatic Speech Recognition (ASR) we would want to see a correlation between thetimestamp-window
and the respectivetext
produced by the ASR model. This would be scalable acrosswaveform
,power-spectrums
andASR text-output
so we can comprehend everything at once.nice-to-have
] ability to apply various types ofwindows
(eg.hanning
,hamming
etc) andfilters
(eg.low-pass
,high-pass
,band-pass
etc.) on a audio or a batch to quick experiment on-the-fly.Describe alternatives you've considered
As far as I know, there is not a comprehensive tool that supports these features, yet. I have to use
Jupyter-notebook
andlibrosa
most of my experimentation and the biggest challenge is making sure that thetimestamp
in audio is exactly same as in thepower-spectrums
.Additional context