`AudioFrame` proposal: Reference external buffer

A lot of the open issues discussing enhancements of some of the API using AudioFrames could be resolved by using memoryview, however memoryview cannot be played or recorded into because the relevant functions also need a rate attached to it.

So if AudioFrame behaved a bit more like memoryview, specifically when using slices, we could easily achieve a lot of the discussed functionality without additional unnecessary memory copies.

Proposal: AudioFrames to be able to reference external buffers

An AudioFrame created from the constructor, or from microphone.record(), would allocate their own buffer.
Each AudioFrame would also contain a "start" and "end" markers (or a buffer pointer and a length)
- These markers are an implementation detail and invisible to the user
- This is similar to how memoryview can point to other buffers
- Exposing the start and end markers can be tempting, but can make AudioFrames harder to understand and it's also not clear how much they can be moved. E.g. as there isn't a way to retrieve the real start and end of the referenced buffer, so these markers could only be used to reduce the AudioFrame and not increase it
AudioFrames generated from slices would reference the buffer from the original AudioFrame, and change its internal start and end markers
AudioFrame.copy() does a make a copy of the buffer

Disadvantages

It might come as a surprised to a user that modifying a slice can change the original AudioFrame:

original_af = audio.AudioFrame(size=1024)
new_af = original_af[512:]
new_af[0] = 255    # This also change original_af[512] to 255

Alternative

We could have a new class that is essential memoryview, which can also point to the rate of the original AudioFrame. This has the advantage that it makes a lot more obvious that are not dealing with a new AudioFrame with its own copy of the data.

Because getting a different class instance from a slice is a bit weird, rather than use slices we could use a method call. For example:

audio_frame = audio.AudioFrame(size=1000)
first_half = audio_frame.track(end=500)
second_half = audio_frame.track(start=500)
middle_half = audio_frame.track(start=250, end=750)

AudioFrame nomenclature

As we consider an "AudioTrack" being created from an AudioFrame, it's becoming more obvious that the AudioFrame name doesn't quite fit the current implementation. As a "frame" is generally small, deriving a "track" out of it doesn't make that much sense. The original intent of grouping multiple frames to create longer audio makes more sense than the current implementation of having frames taking several seconds.

Perhaps should we leave AudioFrame as it was implemented in V1, and rename the current expanded version to something along the lines of "AudioRecording" (could be something different, maybe not directly related to recording from the microphone), to which it would make more sense that it could have multiple "tracks".

Use cases

Copying multiple chunks of data into a single AudioFrame

https://github.com/microbit-foundation/micropython-microbit-v2/issues/194

There isn't slice assignment on AudioFrame, bytearray, nor memoryview, and AudioFrame.copyfrom() always copies data from the beginning of the AudioFrame. So, we have to go byte by byte:

Before

af = audio.AudioFrame(size=(sum([len(c) for c in chunks])))
i = 0
for chunk in chunks:
    for byte in chunk:
        af[i] = byte
        i += 1

After, new AudioFrame

This allows us to copy full chunks in one operation, instead of byte by byte.

af = audio.AudioFrame(size=(sum([len(c) for c in chunks])))
i = 0
for chunk in chunks:
    small_af = af[i:]
    small_af.copyfrom(chunk)
    i += len(chunk)

After, slice assignment

Slice assignment might not be that obvious to novice programmers, but could be an even more succinct option.

af = audio.AudioFrame(size=(sum([len(c) for c in chunks])))
i = 0
for chunk in chunks:
    af[i:i+len(buffer)] = chunk
    i += len(chunk)

Break down AudioFrame into smaller chunks

https://github.com/microbit-foundation/micropython-microbit-v2/issues/194

The best method for this currently is to use a memoryview (could also create a bytes object from the AudioFrame and slice it, but memoryview saves copying the data):

Before

af = audio.AudioFrame(duration=1000)
m = memoryview(af)
for i in range(0, len(m), PACKET_SIZE):
    radio.send_bytes(m[i:i+PACKER_SIZE])

After

With this approach we could use slices directly on the AudioFrame without creating unnecessary copies:

af = audio.AudioFrame(duration=1000)
for i in range(0, len(af), PACKET_SIZE):
    radio.send(af[i:i+PACKER_SIZE])

Playing an AudioFrame from an arbitrary position

https://github.com/microbit-foundation/micropython-microbit-v2/issues/197

As a memoryview cannot be played directly, and an AudioFrame is always played from the beginning, we need to create a new AudioFrame that starts from the point we'd like to playback.

Before

original_af = microphone.record(1000)
memoryview_af = memoryview(af)
shorter_af = audio.AudioFrame(duration=500)
shorter_af.copyfrom(memoryview_af[500:])
audio.play(shorter_af)

After

original_af = microphone.record(1000)
audio.play(shorter_af[500:])

Playing just a portion of the AudioFrame

https://github.com/microbit-foundation/micropython-microbit-v2/issues/197

This works fine in the current implementation, the only thing is that the most common way of doing this would be with sleep() (instead of time.ticks_ms()) to measure time, and the CODAL uBit.sleep() has a resolution of 4ms + any extra overhead from calling functions. So it might not be extremely accurate.

Before

af = microphone.record(2000)
audio.play(af, wait=False)
sleep(1000)
audio.stop()

After

This should accurately play for the specified time

af = microphone.record(2000)
audio.play(af[:len(af)/2])

Conclusions from the call last week to discuss this proposal:

We agree that the name "AudioFrame" doesn't quite fit the current implementation nor proposed changes
Slicing should return an object of the same type
The relationship between "AudioRecording" (a memory buffer + rate + internal marker) and "AudioTrack" (a memoryview of an AudioRecording with its own rate) makes syntactical sense
But we might not need the "AudioRecording" at all
- The main difference is that AudioRecording holds the data buffer
- We could use a bytearray as the main buffer type, and AudioTracks to contain the rate and act as a memoryview of the buffer
- The original reason to keep internal markers was to track of how much data was added into the AudioFrame/AudioRecorder from record_into(), copyfrom() or indexing, so that we don't play blank or old data
  - So record_into() can return an AudioTrack with only the recorded data
However upon further though as I was writing down this proposal, I noticed that without AudioRecording we don't have a way to specify a buffer size with time, e.g. foo = AudioRecording(duration=3000)
- @dpgeorge apologies for the delay summarising this, as I was writing this down last week I realised we might have to reconsider AudioRecording

Updated proposal

AudioRecording(duration, rate=7812) constructor would match the current AudioFrame implementation, internally it allocates its own buffer
- It was previously discussed to add a size argument in bytes to AudioFrame, with this proposal this is not needed
AudioTrack points to a buffer-like object, like AudioRecording and it behaves like a memoryview with its own rate
- Having it's own rate it's important so that multiple tracks from the same buffer can play at different rates, as that is a tangible way to understand their relationship
- Constructor can be AudioTrack(buffer, rate=7812)
- So creating a track with a specific size in bytes can be done via bytearray, e.g AudioTrack(bytearray(128))
An AudioTrack and AudioRecording can be accessed and set via element, e.g. my_audio_recording[0] = my_audio_track[-1]

An AudioTrack can be sliced and returns a new AudioTrack pointing to the same buffer

full_track = AudioTrack(bytearray(2000))
first_half = full_track[:len(full_track)]
second_half = full_track[len(full_track):1000]

AudioRecording cannot be sliced, as it's not obvious that it would return an AudioTrack`.
- To slice an AudioRecording we create an AudioTrack and slice it, e.g. sliced_track = AudioTrack(my_recording)[100:]
Because AudioRecording deals with units of time, it should have a method to slice it with the same units and create an AudioTrack
```
my_recording = AudioRecording(duration=3000)
my_track = my_recording.track(start=1000, end=2000)
```
microphone.record returns an AudioRecording as their arguments match
- i.e my_recording = microphone.record(duration=1000, rate=11000)
microphone.record_into() takes any writeable buffer-like object like AudioRecording, AudioTrack or bytearray, records the data into its buffer and returns an AudioTrack for the length of the recording
- So, record_into(audio_track_5_seconds, wait=False); sleep(1000); stop_recording() records the data inside the buffer from audio_track_5_seconds and returns an AudioTrack pointing to the same buffer, but is only 1 second long.

Open questions

Is it feasible to do slice assignment with AudioRecording and maybe AudioTrack? or would it be better to have a copyfrom() method like AudioFrame?
- With the current copyfrom(buffer) method not taking a "start byte" argument, the main way to write with an offset would be to create an AudioTrack first, e.g. AudioTrack(my_audio_recording)[128:].copyfrom(radio_packet)
Should AudioTrack reject a non-writeable buffer type, like bytes?
- Probably not, we could just let the user create it and only throw an exception when trying to change the data
Is a "writeable buffer-like" object something that MicroPython can easily identify internally? Or would the implementation that need to be a list of acceptable types?
Now that the new types are not expanding AudioFrame, would it be better to use the default 11K sampling rate that CODAL and MakeCode have been using?

@jaustin @dpgeorge thoughts and comments very welcomed, specially to the questions above.

Construction

With the above new proposal, the ways to construct something to record into are:

AudioRecording(duration, rate=7812)
AudioTrack(bytearray(size))
AudioTrack(AudioRecording(duration, rate=7812), rate=7812)

That seems to be a bit awkward, all these ways of constructing look different and it's not obvious which one to use when.

Would it be simpler instead to have a function that creates a bytearray with convenience arguments to specify duration? Eg:

audio.new_recording(*, size, duration, rate=7812) -> bytearray

That's still not great, because the rate is lost when it returns the byetarray, so you'd need to specify the rate again when creating the AudioTrack.

Then maybe the function can return an AudioTrack, eg:

def new_recording(*, size=None, duration=None, rate=7812):
    if duration:
        size = duration * rate // 1000
    return AudioTrack(bytearray(size), rate=rate)

That way there's only one main way to create a new buffer, via this new_recording() helper function.

Slicing and Indexing

It makes sense that indexing uses bytes as the units for the index value. But that means bytes become the default set of units. For example the constructor should then default to take bytes as a positional argument, eg new_recording(size, *, ...). And then slicing should also be in units of bytes.

Then something like AudioRecording.track(start, end) may start to get confusing if start and end are measured in milliseconds.

As for slice assignment (eg track[:10] = bytearray(10)): yes this is possible to implement and I think we should implement it. It allows a convenient way to copy data into a buffer.

Is a "writeable buffer-like" object something that MicroPython can easily identify internally?

Yes, that's easy to do.

Now that the new types are not expanding AudioFrame, would it be better to use the default 11K sampling rate that CODAL and MakeCode have been using?

Maybe... the issue is that audio.play() defaults to 7812Hz because the original AudioFrame doesn't have a rate associated with it.

Maybe... the issue is that audio.play() defaults to 7812Hz because the original AudioFrame doesn't have a rate associated with it.

But different channels in the pipeline have independent sampling rates, no? Or is everything set up to 7812Hz?

Edit: Ah, but this would likely use the same channel as AudioFrames, okay.

Okay, so we ultimate have three approaches to consider.

1) AudioRecording can have its own buffer, or if sliced like a memoryview, it can contain a pointer to the buffer from the original source 2) AudioRecording contains its own buffer and AudioTrack points to an external buffer 3) An AudioTrack points to an external buffer (like a bytearray) and a "factory function" can be used to initialise it

I'll have a chat with the edu team next week to decide between these approaches.

`AudioRecording` to hold buffer or pointer

The AudioRecording constructor would have arguments using both time and byte units. As slices would have to be in bytes, it makes sense for the first positional argument to be in that unit as well.

AudioRecording(size, *, duration, rate=7812)

e.g.

AudioRecording(10_000, rate=5_000)        # 10K bytes to hold 2 seconds of sound
AudioRecording(duration=3_000)            # 3 seconds
AudioRecording(size=10000, duration=3000) # Error: incompatible arguments provided

Slicing is in byes, e.g. my_audio_recording[1000:]

A function to slice in time units would need to be provided:

my_audio_recording.track(start_ms, end_ms)

record_into() can return a "shorter" AudioRecording pointing to the same buffer:

original_buffer = AudioRecording(duration=2_000)
exact_recording = microphone.record_into(original_buffer, wait=False)
sleep(1000)
microphone.stop_recording()
audio.play(exact_recording)   # Plays 1 second of recorded audio
audio.play(original_buffer)   # Plays 1 second of recorded audio followed by 1 blank second

Advantages:

One class for everything
Simpler to have a single class than two overlaping

Disadvantages:

The constructor can contains an incompatible combination of arguments.
Creating a slice, returns a new AudioRecording, which doesn't copy the buffer, it points to the buffer from the original AudioRecording. This could be confusing if modifying one AudioRecording affects the other.

`AudioRecording` & `AudioTrack`

The AudioRecording class contains the buffer internally, and AudioTrack can be created to sliced. Or an AudioTrack can be created from an AudioRecording or other types of buffer.

AudioRecording(duration, rate=7812)
AudioTrack(buffer, rate=7812)

my_audio_recording = AudioRecording(1000)                 # Contains 1 second worth of data
my_audio_track = AudioTrack(my_audio_recording)[250:750]  # Auditrack points to the buffer in my_audio_recording
my_track = my_audio_recording.track(start_ms=100, end_ms=200) # A 100ms track

To work in bytes instead of time, an AudioTrack can be created from a bytearray.

my_track = AudioTrack(bytearray(10_000))
for i in range(0, len(my_track), PACKET_SIZE):
    radio.send(my_track[i:i+PACKER_SIZE])

When using record_into it would save the data into an AudioRecording and return an AudioTrack with the exact length of the recording.

my_recording = AudioRecording(duration=2_000)
my_track = microphone.record_into(my_recording, wait=False)
sleep(1000)
microphone.stop_recording()
audio.play(my_track)        # Plays 1 second of recorded audio
audio.play(my_recording)    # Plays 1 second of recorded audio followed by 1 blank second

Advantages:

Separation of buffer vs memoryview
Separation of AudioRecording using time units and AudioTrack using byte units

Disadvantages:

Two classes instead of one, with some overlapping
There could be confusion about which unit type each thing is meant to use

There are multiple ways to create a class that can hold recording data and it not be clear which one to use

AudioRecording(duration, rate=7812)
AudioTrack(bytearray(size))
AudioTrack(AudioRecording(duration, rate=7812), rate=7812)

`AudioRecording` + factory function

As having multiple ways to initialise an AudioTrack and its buffer can be confusing, we could have a single class that behaves like an AudioTrack (could be called AudioRecording, but for clarity in this section is still called AudioTrack) and we can provide a factory function to create its buffer:

my_track = audio.new_track(duration=3_000)

Where would have a very simple implementation:

def new_track(*, size=None, duration=None, rate=7812):
    if size and duration:
        raise Exception("Incompatible arguments")
    if duration:
        size = duration * rate // 1000
    return AudioTrack(bytearray(size), rate=rate)

So, in this case microphone.record_into() takes and returns an AudioTrack. And microphone.record() would return an AudioTrack from a buffer created by the microphone function.

Advantages:

Only a single class vs having two
AudioTrack only unit is bytes, so less confusion in that area

Disadvantages:

It's one class, but it also has a function that looks like the AudioRecording constructor
First time a factory function is used in the micro:bit API, will this be confusing to micro:bit users?

After discussing it with the edu team I think we should go with the AudioRecording + AudioTrack approach. They are all good enough options, and while with this option there might be multiple ways to initialise a class to hold audio data, the cleaner API is worth it.

I'll update the docs PR, but @dpgeorge feel free to start the implementation when you have a chance.

OK, I've now implemented the new AudioTrack / AudioRecording API. I've tested it but I can imagine there are some things that still need a bit of work.

microbit-foundation / micropython-microbit-v2