hrvach / talkberry

Low-bitrate open source speech codec for Raspberry Pi RP2040
GNU Lesser General Public License v2.1
34 stars 3 forks source link

:speech_balloon: Talkberry - A Speech Codec Library for RP2040

This project provides a speech codec library for the RP2040 microcontroller chip using the open-source Codec2 algorithm. It enables storing long segments of speech in the limited memory of an embedded system or transferring digital voice across narrow band channels such as HAM radio.

The RP2040 microcontroller chip is a powerful and affordable option for embedded systems, based on the ARM Cortex-M0+ architecture. However, it has no floating-point instructions which makes the existing implementation run considerably slower than realtime (3.5x), despite it being well-written.

This port is trying to speed up the execution on limited platforms like this by implementing the math with fixed-point arithmetic. Having a super-cheap, readily available IC opens up various possibilities for shortwave HAM radio, voice-over-IP (VoIP) systems, and voice storage. Imagine a FreeDV implementation on a 4$ board the size of a postage stamp that could be retrofitted to any radio!

Features

Audio Sample (... not hearing anything? unmute the player)

https://user-images.githubusercontent.com/7192749/222575368-ad214381-2826-42ff-9a80-032405458778.mp4

Getting Started

Pico wiring diagram

To use the speech codec library in your project, simply include the header file and link to the library file. The library interface is super simple and provides essential functions for encoding and decoding speech (well, just decoding for now). Here is the tl;dr version of how to use the library:

To quickly test it:

  1. Wire the audio like suggested on the diagram
  2. Download pico-hamradio.uf2 from the demos folder
  3. Power on the pico while holding the BOOTSEL button
  4. Pico will act like an USB drive
  5. Copy the file to it

It is a good idea to add a small resistor (22 or 47 Ω) to GP1 which is used as channel B PWM output. Permanently fixed at mid-point of the output voltage range, it uses inverted PWM polarity. These two hacks help reduce the noise level and produce a better (slightly less annoying) audio.


Building

... for PICO

mkdir build
cd build
PICO_SDK_PATH=/your/sdk/path cmake ..
make

... for RPI

mkdir build
cd build
cmake ..
make

Currently, data is a simple byte array in the header file data.h, read in 7-byte chunks. This wastes 4 bits since packets are 52 bits long. The solution would be to combine two packets in 13 bytes (initial support is provided in the unpack function).

Converting the audio to a suitable format

If you want to replace the provided audio with your own, it will need to be encoded with codec2.

Many Linux distros have Codec2 packaged, so you can easily install it (apt install codec2 on Debian-based distros) and then use c2enc to encode speech. Here is an example how you might approach encoding:

1. sox infile.wav -e signed -b 16 -c 1 - r 8000 raw_voice_data.raw
2. c2enc 1300 raw_voice_data.raw coded_data.bin
3. xxd -i coded_data.bin > /project/rootdir/header/data.h

This requires sox, codec2 and xxd to be installed.

Notes On Performance Optimisation

Encoding time for 1 second of audio on 1 core running at 125 MHz clock

Floating point ▏   3.542 ██████████████████████████████████████████████████████████
Fixed point    ▏   0.473 ████████                         

LUTs in flash are a performance penalty

Initially, the build uses ~4% memory and 18% flash.

Memory region         Used Size  Region Size  %age Used
           FLASH:      380284 B         2 MB     18.13%
             RAM:       11396 B       256 KB      4.35%
       SCRATCH_X:          0 GB         4 KB      0.00%
       SCRATCH_Y:          0 GB         4 KB      0.00%

If you move FFT LUTs to RAM (not declaring them constant should do it), you lose ~30% of RAM but gain ~9% speed and 12% flash.

Memory region         Used Size  Region Size  %age Used
           FLASH:      128684 B         2 MB      6.14%
             RAM:       88676 B       256 KB     33.83%
       SCRATCH_X:          0 GB         4 KB      0.00%
       SCRATCH_Y:          0 GB         4 KB      0.00%

Avoiding expensive operations

$$ sin(n \cdot x) = 2 \cdot sin[(n-1) \cdot x] \cdot cos(x) - sin[(n-2)\cdot x] $$

$$ cos(n \cdot x) = 2 \cdot cos[(n-1) \cdot x] \cdot cos(x) - cos[(n-2)\cdot x] $$

$$ z=max(z{0}, z{1})\, $$

$$ z{0}=\alpha{0} \cdot max(|re|, |im|) + \beta_{0} \cdot min(|re|, |im|) $$

$$ z{1}=\alpha{1} \cdot max(|re|, |im|) + \beta_{1} \cdot min(|re|, |im|) $$

where α₀=1, β₀=5/32, α₁=27/32, β₁=71/128.

$$ cos(\phi) = \frac{x}{\sqrt{x^{2}+y^{2}}}, \quad sin(\phi) = \frac{y}{\sqrt{x^{2}+y^{2}}} $$

we estimate the magnitude using alpha-max, beta-min algorithm to avoid the computationally expensive denominator. This approach only requires a division, which the RP2040 hardware division coprocessor can handle without any issues. By using this method, we can avoid using square roots and limit the number of calls to trigonometric functions. In fact, we only need a single sin/cos for the initial recursion step and 10 calls to cos to convert line spectral frequencies to line spectral pairs. To implement trigonometry using only additions and shifts, we use the CORDIC algorithm.

FAQ

Q: Where do I connect GND ?

A: You don't, it's just GP0 and GP1.

Q: This sounds terrible

A: Well, definitely not Hi-Fi ¯\_(ツ)_/¯ - it's compressed to the point where you only send a few lookup table indexes that describe a speech frame to rebuild that segment from scratch.

Q: With Cortex-M4 and Risc-V available, why bother?

A: It's very cheap, efficient, available to order (a big factor nowadays) and extremely popular. Also I wanted to learn DSP a bit. :-)

Q: Can this work on RPi?

A: Yes, I tried it on RPi 4 and it runs just fine. However, having a FPU and things like NEON SIMD instruction set makes floating point a better choice. If you're building for the RPi and expect performance, consider linking against the full CMSIS library which has different FFT implementation for CPUs with more advanced instruction sets.

Q: I'm getting linker error, there is not enough RAM

A: xxd generates the data array as unsigned char so it's not kept in flash. Add "const" in front of it or declare it like this:

#include <pico/platform.h>

const unsigned char __in_flash() coded_data [] = {

Issues

There are 🪲, dynamics are not quite right (frequency domain bin power calculation needs tweaking), postfilter is greatly simplified to avoid additional FFT. It would be convenient not having to perform that FFT but do the transform on the frequency domain directly or operate on the time domain. My limited n00b DSP skills are not there yet.

TODO

License

This library is free software. You can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation - either version 2.1 of the License, or (at your option) any later version. See the file LICENSE included with this distribution for more information.

Credits

Codec2 was skillfully architected and designed by mr. David Rowe.