go-audio / audio

Generic Go package designed to define a common interface to analyze and/or process audio data
Apache License 2.0
205 stars 11 forks source link

is this API appropriate, especially for real time use #3

Open mattetti opened 7 years ago

mattetti commented 7 years ago

This discussion is a follow up from this initial proposal. The 2 main arguments that were raised are:

@egonelbre @kisielk @nigeltao @taruti all brought up good points and Egon is working on a counter proposal focusing on smaller interfaces with compatibility with types commonly found in the wild (int16, float32).

As mentioned in the original proposal, I'd like to this organization of a special interest group of people interested in doing more/better audio in Go. I have to admit my focus hasn't been real time audio and I very much appreciate the provided feedback. We all know this is a challenging issue which usually results in a lot of libraries doing things in very different ways. However, I do want to believe that we, as a community and with the support of the core team, can come up with a solid API for all Go audio projects.

egonelbre commented 7 years ago

The link to the alternate design: https://github.com/egonelbre/exp/tree/master/audio

101 of real-time audio http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing

mattetti commented 7 years ago

@egonelbre would you mind squashing your commits for the proposal or maybe send a PR. GitHub really makes it hard to comment on different part of the code coming from different commits :(

taruti commented 7 years ago

Typically when using audio my needs have been:

1) Read from input source (typically system IO + slice of []int16 or []float32) 2) Filter&downsample&convert to preferred internal format (typically []float32) 3) Do all internal processing with that type (typically []float32) 4) Maintain as little latency as possible by keeping cpu and memory allocation (and with that GC) in check

egonelbre commented 7 years ago

@mattetti sure no problem.

Say you are designing a sample-based synthesizer (eg: Akai MPC) and your project has an audio pool it is working with. You'll want to be storing those samples in memory in the native format of your DSP path so you don't have to waste time doing conversions every time you are reading from your audio pool.

@kisielk sure, if you have such sample-based synth you probably need to track what notes are playing, etc. anyways so you would have a Synth node that produces float32/float64, i.e. you pay the conversion per synth not per sample. It's not as good as no conversion, but it just means you can have one less-effect overall for the same performance.

egonelbre commented 7 years ago

@mattetti Here you go: https://github.com/egonelbre/exp/commit/81ba19e90fbcb31986c801838a17606c76dfd4d9

kisielk commented 7 years ago

Yes but the "synth" is not going to be limited to one sample, usually you have some number of channels, say 8-16, and each one can choose any part of any sample to play at any time. In my opinion processing audio in float64 is pretty niche, relegated to some high precision or quality filters which aren't commonly used. Even in that case, the data can be converted to float64 for processing just within that filter block, there's little reason to store it in anything but float32 otherwise. Even still most DSP is performed using float32 even on powerful architectures like x86, reason being that you can do twice as much with SIMD instructions in that case.

Of course I'm totally fine with having float64 as an option for a buffer type when appropriate, but I believe that float32 should be on par. I feel like it would certainly be the primary format for any real-time applications. Even for batch processing you are likely to see performance gains from using it.

egonelbre commented 7 years ago

@kisielk Yes, also, for my own needs float32 would be completely sufficient.

Forums seemed to agree that in most cases float64 isn't a signifcant improvement. However, if one of the intended targets will be writing audio plugins; then many plugin API-s include float64 version (e.g. VST3) and DAW-s have an option to switch between float32 and float64.

I agree that, if only one should be chosen then float32 seems more suitable. (Although. I don't think I have the full knowledge of audio processing to definitively say it.) The only argument for float64 is that math package works on float64. So only using float32 means there is a need for math32 package.

mattetti commented 7 years ago

I agree that float32 is usually plenty enough but as mentioned my problem is that the Go math package is float64 only. Are we willing to reimplement the math functions we need? It might make sense if we start doing asm optimizations but that's quite a lot of work.

kisielk commented 7 years ago

Again, I don't think it's a binary choice, I just think that both should have equal support within the API. And yes, if I was using Go for realtime processing of audio I would definitely want a 32-bit version of the math package. I don't think the math package needs to dictate any limitations on any potential audio API.

mattetti commented 7 years ago

@kisielk sounds fair, just to be clear, would you be interested in using Go for realtime processing or at least giving it a try? You obviously do that for a living using C++ so your expertise would be invaluable.

egonelbre commented 7 years ago

Are we willing to reimplement the math functions we need?

How much math functions are needed in practice? Initially the package could be a wrapper around math to make it more convenient and then start optimizing the bottlenecks. I never needed more than sin/cos/exp/abs/rand; but I've never done anything complicated either.

I suspect some of the first bottleneck and candidate for "asm optimized" code will be []int16->[]float32 conversion, buffer multiplication and/or addition two buffers together.

kisielk commented 7 years ago

@mattetti that is something I'm definitely interested in. I'm not exactly a DSP expert, but I work enough with it day to day to be fairly familiar with the domain.

@egonelbre Gain is also a big one that benefits from optimization. (edit: maybe that's what you meant by buffer multiplication, or did you mean convolution?)

egonelbre commented 7 years ago

@kisielk yeah, I meant gain :), my brains language unit seems to be severely malfunctioning today.

taruti commented 7 years ago

math package (trigonometric, logarithmic, etc) with float32 and SIMD optimization for any data type are two different things. In many cases just mult/add/sub/div are needed and for those package math is not needed.

I think that math32 and SIMD are best kept separate from this proposal.

If we are thinking of performance then conversions of buffers without needing to allocate can be important. For example have one input buffer and one output buffer for the conversion. Instead of allocating a new output buffer each time.

kisielk commented 7 years ago

@taruti +:100:

kisielk commented 7 years ago

Speaking of conversion between buffers, I think it's important the API has a way to facilitate conversion between buffers of different data types and sizes without allocation (eg: 2 channels to 1, etc). The actual conversion method would be determined by the application but at least the API should be able to help facilitate this without too much additional complexity.

mattetti commented 7 years ago

Alright, here is my suggestion. I'll add you guys to the organization and we can figure out an API for real time processing and from there see how it works for offline. Ideally I would love to end with:

@rakyll and I also discussed adding wrappers to things like CoreAudio on Mac so we could have an end to end experience without having to rely on things like portaudio. This is outside of the scope of what I have in mind but I figured I should mentioned it.

I like designing APIs against real usage, so maybe a first good step is to define an example we would like to build and from there define the components we need. Thoughts?

kisielk commented 7 years ago

That sounds like a good idea to me. However I would propose we limit the scope of the core audio package to the first two points (and perhaps a couple of very general utilities from point 3). I feel like the rest would be better suited for other packages. My main reasoning behind this is that I feel like the first two items can be achieved (relatively) objectively and there can be one canonical implementation. As you go down the list it becomes increasingly application-dependent.

mattetti commented 7 years ago

I think the audio API should be in its own package and each of those things in separate packages. For instance I have the wav and aiff packages isolated. That's another reason why having a GitHub organization is nice.

kisielk commented 7 years ago

Just noticed that when looking at the org page. Looks good to me 👍

nigeltao commented 7 years ago

There's the original proposal. @egonelbre has an alternative proposal. Here are a couple more (conflicting) API ideas for a Buffer type. I'm not saying that either of them are any good, but there might be a useful core in there somewhere. See also another API design in the github.com/azul3d/engine/audio package.

Reader/Writer-ish:

type Buffer interface {
    Format() Format

    // The ReadFrames and WriteFrames methods are roughly analogous to bulk
    // versions of the Image.At and Image.Set methods from the standard
    // library's image and image/draw packages.

    // ReadFrames converts that part of the buffer's data in the range [offset
    // : offset + n] to float32 samples in dst[:n], and returns n, the minimum
    // of length and the number of samples that dst can hold.
    //
    // offset, length and n count frames, not samples (slice elements). For
    // example, stereo audio might have two samples per frame. To convert
    // between a frame count and a sample count, multiply or divide by
    // Format().SamplesPerFrame().
    //
    // The offset is relative to the start of the buffer, which is not
    // necessarily the start of any underlying audio clip.
    //
    // The n returned is analogous to the built-in copy function, where
    // copy(dst, src) returns the minimum of len(dst) and len(src), except that
    // the methods here count frames, not samples (slice elements).
    //
    // Unlike the io.Reader interface, ReadFrames should read (i.e. convert) as
    // many frames as possible, rather than returning short. The conversion
    // presumably does not require any further I/O.
    //
    // TODO: make this return (int, error) instead of int, and split this into
    // audio.Reader and audio.Writer interfaces, analogous to io.Reader and
    // io.Writer, so that you could write "mp3.Decoder(anIOReader)" to get an
    // audio.Reader?
    ReadFrames(dst []float32, offset, length int) (n int)

    // WriteFrames is like ReadFrames except that it converts from src to this
    // Buffer, instead of converting from this Buffer to dst.
    WriteFrames(src []float32, offset, length int) (n int)
}

type BufferI16 struct {
    Fmt  Format
    Data []int16
}

type BufferF32 struct {
    Fmt  Format
    Data []float32
}

Have Buffer be a concrete type, not an interface type:

type Buffer struct {
    Format Format

    DataType DataType

    // The DataType field selects which slice field to use.
    U8  []uint8
    I16 []int16
    F32 []float32
    F64 []float64
}

type DataType uint8

const (
    DataTypeUnknown DataType = iota
    DataTypeU8_U8
    DataTypeU8_I16BE
    DataTypeU8_I16LE
    DataTypeU8_F32BE
    DataTypeU8_F32LE
    DataTypeI16
    DataTypeF32
    DataTypeF64
)
mattetti commented 7 years ago

In addition, here is another comment from @nigeltao about the math library:

As for a math32 library, I'm not sure if it's necessary. It's slow to call (64-bit) math.Sin inside your inner loop. Instead, I'd expect to pre-compute a global sine table, such as "var sineTable = [4096]float32{ etc }". Compute that table at "go generate" time, and you don't need the math package (or a math32 package) at run time.

I really like this idea which can also apply to log. It might come at an extra memory cost but I am personally OK with that.

Let try to summarize the pros and cons of those different approaches and let's discuss what we value and the direction we want to take. I am now convinced that my initial proposal, while fitting my needs, doesn't work well in other scenarios and shouldn't be left as is.

nigeltao commented 7 years ago

A broader point, re the proposal to add packages to the Go standard library or under golang.org/x, is that I think it is too early to say what the 'right' API should be just by looking at an interface definition. As rsc said on https://github.com/golang/go/issues/18497#issuecomment-270387898: "The right way to start is to create a package somewhere else (github.com/go-audio is great) and get people to use it. Once you have experience with the API being good, then it might make sense to promote to a subrepo or eventually the standard library (the same basic path context followed)." Emphasis added.

The right way might actually involve letting a hundred API flowers bloom, and trying a few different APIs before making a push for any particular flower.

I'd certainly like to see more experience with how audio codecs fit into any API proposal: how does the Buffer type (whatever it is) interact with sources (which can block on I/O, e.g. playing an mp3 stream over the network) and sinks (which you don't want to glitch)?

WAV and AIFF are a good start, but handling some sort of compressed audio would be even better. A full-blown mp3 decoder is a lot of work, but as far as kicking API tyres, it might suffice to write a decoder for a toy audio codec where "c3d1e3c1e2c2e4" decoded to "play a C sine wave for 3 seconds, D for 1 second, E for 3 seconds, etc", i.e. to play a really ugly version of "doe a deer".

nigeltao commented 7 years ago

Back on API design brainstorming and codecs, there might be some more inspiration in the golang.org/x/text/encoding/... and golang.org/x/text/transform packages, which let you e.g. convert between character encodings like Shift JIS, Windows 1252 and UTF-8.

Text encodings are far simpler than audio codecs, though, so it might not end up being relevant.

kisielk commented 7 years ago

Some more API inspiration, from C++:

https://www.juce.com/doc/classAudioBuffer https://www.juce.com/doc/classAudioProcessor

JUCE is one of the most-used audio processing libraries out there.

kisielk commented 7 years ago

Obviously the API isn't very go-like since it's C++ (and has a fair amount of pre-C++11 legacy, though is gradually being modernized) but it's worth taking a look at how they put things together.

mattetti commented 7 years ago

JUCE uses overloading quite heavily and as mentioned isn't very go-like (it's also a framework more than a suite of library, but it is well written and very popular). My hope is that we can come up with a more modern and accessible API instead of "port", I would really want audio in Go to be much easier for new developers. On a side note, I did port over some part of JUCE such as https://www.juce.com/doc/classValueTree for better interop with audio plugins.

kisielk commented 7 years ago

I'm not suggesting porting it, but I think the concepts in the library are pretty well thought out and cover most of what you would want to do with audio processing. It's worth getting familiar with. I don't think the use of overloading really matters, it's pretty easy to do that in other ways with Go.

mattetti commented 7 years ago

@nigeltao I agree with rsc and to be honest my goal was more to get momentum than to get the proposal accepted. I'm very happy to have found a group of motivated people who are interested in tackling the same issue.

I'll open a couple issues to discuss code styling and "core values" of this project.

egonelbre commented 7 years ago

@nigeltao I think my design would also benefit from a Stream/Seeker (or similar) interface, but I'm not sure what the right approach is. I will try to implement some basic "remote-streaming", to find out what is essential. I have a feeling that it could fit together with Buffer32 nicely.

mattetti commented 7 years ago

I really like @nigeltao proposal here https://github.com/go-audio/audio/issues/3#issuecomment-270553932 . I had something similar earlier: https://github.com/mattetti/audio/blob/master/pcm_buffer.go#L43 But I couldn't find a way to properly read/write the buffer, Nigel solves that nicely with:

  ReadFrames(dst []float32, offset, length int) (n int)
  WriteFrames(src []float32, offset, length int) (n int)

The part I don't understand is how do you avoid the type conversion if you aren't working in float32. Let say you want to stay in int16 or float64, what do you do? What if you worked in float32s and need to go to int16, what's the API for that?

egonelbre commented 7 years ago

I think we need to clarify the terms. Here's how I understand the core terms of audio:

  1. Buffer: uncompressed PCM buffer for processing (usually []float32 array, with some additional info)
  2. Stream (...Writer/...Reader): it is often seekable, and can sometimes only be read/written in Frames.
  3. Frame: a chunk of audio data. The Frame format/size can be different from Buffer and may need conversion. The internals of Frame depend highly on the Stream producing it. Frame-s can change mid-stream. (Buffers are equivalent to Frames in some cases)
  4. Node/Processor: processes Buffers, can have internal mutable state and potentially uses or can be an seekable/unseekable Stream
  5. Codec: takes io.Reader/io.Writer and implements a Stream, also has ways of detecting whether some io.Reader is a valid encoding
  6. OutputDevice/InputDevice: implements unseekable StreamReader/StreamWriter and needs to talk to the hardware, ideally the Buffer and Frame formats/sizes match.
  7. metadata: Stream and Frame often have more information than Buffers.

PS: these are tentative.

briansorahan commented 7 years ago

This is my 2 cents as well as shameless self promotion. I created package sc specifically so that I could make synths with Go without having to worry about whether or not it is suitable for real-time audio. Admittedly, I hate sclang as a programming language and my use case is pretty specific. I simply want to make synths that run on my laptop which I can wire up to my MIDI controller. I'm curious to hear why we want to do realtime audio in Go when there are so many other tools out there that do this very well (SuperCollider, ChucK, faust, Extempore, etc, etc). What are the use cases? If the goal is to experiment with writing dsp algorithms in your favorite programming language then more power to ya. I'm definitely curious to see what kind of realtime audio processing the Go runtime can handle without glitching, and I'm all for hands-on learning. But if the goal is adoption among people who have a practical interest in realtime audio, I think absence of glitching is much more important than choice of programming language.

kisielk commented 7 years ago

Real-time audio processing isn't only about synthesis or effects for music / experimental purposes. There's lots of other practical applications, eg: VoIP

Other tools are good, but they also impose additional overhead on a project. If the rest of your application is in Go, you now need to have some way to interface with those. Building and distributing your software becomes more complicated. It's the same reason why it's preferable to have native Go code instead of linking to C libraries via cgo or calling out to external processes.

I don't see why you think glitching would be a big concern. So long as the application is able to keep up with the audio sample rate, there should be no glitches. As of Go 1.8 the typical worst case GC pause is under 100 µsec. If you avoid doing a lot of allocations, that should be even better.

egonelbre commented 7 years ago

I'm curious to hear why we want to do realtime audio in Go when there are so many other tools out there that do this very well.

My main reason is writing games with dynamic audio in Go (e.g. first thing I'm going to try is adding it to https://github.com/loov/zombies-on-ice; the hammer would have pitch based on the speed and smashes are panned based on player location). I could use existing audio-libs, but it means a huge annoyance in compiling things.

I'm definitely curious to see what kind of realtime audio processing the Go runtime can handle without glitching, and I'm all for hands-on learning.

The lowest I have gotten on Windows is 512 samples ~ 11ms latency. It was pretty much the first implementation and I haven't started extensive debugging --- quite likely I'm using Windows API wrongly or should be using WASAPI or I have some stupid mistake in Go code. (https://github.com/loov/synth)

But if the goal is adoption among people who have a practical interest in realtime audio, I think absence of glitching is much more important than choice of programming language.

I agree. Any professional plugin will probably still be written in C/C++ (or whatever the state-of-the-art is).

I think the target demographic is "enthusiast level real-time audio". (This of course still means doing our best in implementing things and trying to be as "professional-level" as possible.)

mattetti commented 7 years ago

At work we do server side processing and analysis, we are also planning on doing more on desktop and there is little reason not to use Go. We don't currently really need real time audio but that might become a thing as we grow a good library. Go could also make it for a great language to write instruments on something like a Pi that should be plenty powerful enough to behave very well. Remember that there are DAWs written in Java ;)

nigeltao commented 7 years ago

how do you avoid the type conversion if you aren't working in float32. Let say you want to stay in int16 or float64

It'd be similar to the image.Image type in the standard library, and how the image/draw package type switches for faster (or less allocate-y) code paths. An audio gain filter would take a Buffer argument (the interface type). It would type switch on some well known, common, concrete types, such as F32Buffer or I16Buffer. For an I16Buffer, it would read and write int16 values, without calling the ReadFrames or WriteFrames methods per se.

If none of the type switch cases match, it would fall back to ReadFrames and WriteFrames, possibly working with its own (possibly lazily allocated) [1024]float32 scratch space.

mattetti commented 7 years ago

I really like this approach, here is the Draw function code: https://golang.org/src/image/draw/draw.go?s=2824:2903#L90

It keeps the API simple, avoids a lot of duplicated methods and offers a nice fallback.

egonelbre commented 7 years ago

@nigeltao first version of toy audio codec in my version:

egonelbre commented 7 years ago

And... after thinking about it, I realized that when you do not expose the Stream/Device internals the code becomes much clearer. This of course doesn't mean that when you know the internals you could still use it via type switching -- alternatively Stream could have a "PreferredBuffer32"/"PreferredBuffer64" method.

kisielk commented 7 years ago

Do you think it's really necessary to return the number of samples processed? What's a situation where you'd want to process something smaller than the given buffer size? Also does the number indicate number of bytes, or number of frames?

egonelbre commented 7 years ago

Do you think it's really necessary to return the number of samples processed? What's a situation where you'd want to process something smaller than the given buffer size?

The main use case is when you are doing conversion of an existing audio with (large) buffers.

The alternative approach is to expose the number of frames still available, but this could get problematic with stream where you don't know that number in advance. Unfortunately, I don't know a good design that avoids it. The Frame approach could do it, but it causes some other code to be much more complicated.

The sample number is not necessary for real-time audio or audio-effects. Also, the initial version didn't have it, I added it after started working on the stream/conversion examples.

Also does the number indicate number of bytes, or number of frames?

Number of samples (== frames * number of channels). You use it like this:

n, err := node.Process32(buffer)
unprocessed := buffer.Data[n:]
egonelbre commented 7 years ago

PS: the Process32 means a generic operation read or write; e.g. it could mean reading data from a microphone.

kisielk commented 7 years ago

It seems like the buffer needs to have something equivalent to a len and capacity. Suppose you pass a buffer of size M to node1. It fills the buffer with N frames and returns n. Now you pass the buffer to node2. How does node2 know that it should only process N frames when the buffer is of size M?

taruti commented 7 years ago

I feel like we might be going into this too much abstract API first.

There are many audio libraries in Go, would it help to look at them and see what actual minimal definitions would help e.g. when switching from one library to an another?

Easy things to unify could be:

egonelbre commented 7 years ago

It seems like the buffer needs to have something equivalent to a len and capacity.

Yeah, I thought about it. You would need ReadHead and WriteHead, you already have the capacity in the Buffer; playing around with cap/len would be possible, but it would make writing to the buffer more annoying.

Suppose you pass a buffer of size M to node1. It fills the buffer with N frames and returns n. Now you pass the buffer to node2. How does node2 know that it should only process N frames when the buffer is of size M?

Currently I've changed the buffer outside the node, and make a temporary "Buffer" with the same backing slice, but with different head and len. See https://github.com/egonelbre/exp/blob/master/audio/copy.go#L6

However, I agree that it is error-prone and not having to return the sample count would be cleaner. Have to experiment with that design.

mattetti commented 7 years ago

I would like to focus on the buffer only for now and keep the API small. So far I like Nigel proposal the best, we can remove the AsXxx methods from my proposal and get something similar to image.Image.

Once we are happy with that and have an implementation we can test it further against things like other abstractions and other existing libraries that might exist. (The game engine Nigel linked to is quite interesting)

mewmew commented 7 years ago

There are many audio libraries in Go, would it help to look at them and see what actual minimal definitions would help e.g. when switching from one library to an another?

Just to add to the list. There exist a FLAC decoder written in Go at github.com/mewkiz/flac with a front-end decoder for the audio decoder interface defined by Azul3D at github.com/azul3d/engine/audio/flac.

My brother and I have begun implementing a front-end for the go-audio/audio interface, and it should not proove to be too difficult.

In general, I'd recommend anyone who hasn't already to take a look at the audio interface defined by Azul3D. It takes great inspiration from the image.Image package, and does provide some inspiration for how an audio API in Go may look like.

@slimsag, @karlek Feel free to join the discussion here if you have any input : )

egonelbre commented 7 years ago

I feel like we might be going into this too much abstract API first.

Something particular you would like to point out in the code?

There are many audio libraries in Go, would it help to look at them and see what actual minimal definitions would help e.g. when switching from one library to an another?

Much of my design has also involved going through multiple audio API-s and designs. Although I haven't looked at Go audio libs in particular.

However, I think the approach of trying to base a new library to improve others doesn't yield good results. The approach of "implement first" and see how it can be fit into/with other libs has given me better code. Of course, this doesn't mean everyone has to use the same approach -- i.e. different approaches show different issues.

how to read samples from a library audio input source in the preferred format (the available formats of a source may vary) without copying how to write samples to audio output library in their preferred format without copying

See the Frame design and Nigel-s approach. But, not converting seems to add a big burden of managing all the different buffer formats.

Output/Input devices can also benefit from a callback based approach. (And many libraries have chosen that route.)

how to read samples from an library audio input source with conversion how to write samples to audio output library with conversion

See Process32/Process64.

mattetti commented 7 years ago

@mewmew https://godoc.org/azul3d.org/engine/audio#Buffer is interesting but the implementation bothers me a little, especially the custom types such as https://godoc.org/azul3d.org/engine/audio#Slice

It would be interesting to see if we can make our Buffer interface and implementations compatible with azul3d's audio lib. Take a look at https://github.com/go-audio/audio/issues/3#issuecomment-270553932 which is a simplified version of what I proposed. I believe it addresses the main issue of having AsFloatBuffer() *FloatBuffer etc.. in the interface.

I don't see why we couldn't get azul's buffer to conform to this API too even if "our" buffer implementations would be simpler.