roc-streaming / roc-toolkit

Real-time audio streaming over the network.
https://roc-streaming.org
Mozilla Public License 2.0
1.08k stars 213 forks source link

Implement IFrameEncoder / IFrameDecoder using libvorbis #758

Open gavv opened 3 months ago

gavv commented 3 months ago

We have two interfaces IFrameEncoder and IFrameDecoder that are used to encode audio samples into packet payload and decode it back. Currently they're implemented using PcmEncoder and PcmDecoder (for uncompressed PCM).

Now we need to add two more implementations: VorbisEncoder and VorbisDecoder, that will use libvorbis library to do the job.

We also need to add new encoder & decoder to the list of tested codecs in test_frame_encoder_decoder.cpp.

runei commented 3 months ago

Hi, I am interested in solving this issue.

gavv commented 3 months ago

Welcome, thank you!

gavv commented 3 months ago

Let me know if you'll need help/hints with the build system.

Also, if libogg dependency can be disabled, I suppose we should do it, because all we need is vorbis encoder/decoder, we don't need to work with ogg files.

runei commented 3 months ago

Hi @gavv ,

I tried without the libogg, but it fails when installing vorbis. According to the README, "You'll need libogg (distributed separately) to compile this library.", which is why I tried to install it before libvorbis.

I managed to install vorbis with the build script after installing libogg. But I'm having trouble in installing the libogg with the script, since I used "make install" on the terminal. Using "execute(ctx, 'make install')" on the script need permission on CI, as it creates the files at " /usr/local/lib". So how can I install libogg within the same folder and utilize the generated files for the libvorbis installation?

gavv commented 3 months ago

Oh, I see. I assumed that libogg should support using libvorbis as codec, but instead, libvorbis supports using libogg for muxing.

For rfc5215, we'll need to disable ogg muxing, which seems to be feasible. libvorbis encodes packets into a struct from libogg (ogg_packet), however this struct does not yet have OGG headers, only opaque codec-specific data. OGG headers are added later when you use libogg API, and we can just don't use it and use data from ogg_packet directly.


When build-3rdparty.py builds a package, it never runs make install. This script is for building dependencies locally, not installing them system wide. Instead, in the end of building every package, there are calls to install_tree() and install_files() functions that install headers and libraries into a local directory specific for this package.

For example, here is what is "installed" for json-c:

tree -L 2 build/3rdparty/x86_64-pc-linux-gnu/clang-14.0.6-release/json-c-0.12-20140410/
build/3rdparty/x86_64-pc-linux-gnu/clang-14.0.6-release/json-c-0.12-20140410/
├── include
│   ├── doc
│   ├── tests
│   ├── arraylist.h
│   ├── bits.h
│   ├── config.h
│   ├── debug.h
│   ├── json_config.h
│   ├── json_c_version.h
│   ├── json.h
│   ├── json_inttypes.h
│   ├── json_object.h
│   ├── json_object_iterator.h
│   ├── json_object_private.h
│   ├── json_tokener.h
│   ├── json_util.h
│   ├── linkhash.h
│   ├── math_compat.h
│   ├── printbuf.h
│   └── random_seed.h
├── lib
│   └── libjson-c.a
├── src
│   ├── json-c-json-c-0.12-20140410
│   └── json-c-0.12-20140410.tar.gz
├── build.log
└── commit

These include and lib directories are added to the include and library search path in scons, so we can use the dependency in our code.


OK, so what we need to do if we need to build libogg and then pass it as a dependency to libvorbis?

Here is an example of similar situation:

https://github.com/roc-streaming/roc-toolkit/blob/b102bb7c96856efb89151ae841a7ffe9498a3b52/3rdparty/SConscript#L267-L285

Here, ltdl and json-c are dependencies of pulseaudio. We first build dependencies, and then pass them via deps argument to BuildThirdParty(). BuildThirdParty, in turn, passes them to build-3rdparty.py script, where those dependencies are stored in ctx.pkg_deps.

When we call format_flags(), it iterates all those dependencies (if they're present) and add their include and lib directories to the flags.

Summary:

gavv commented 3 months ago

Feel free to ping me if you'll have troubles with those scripts, I know it's not the funniest part :)

runei commented 2 months ago

Hi @gavv , I am working on the handling of Vorbis stream headers. When encoding with vorbis, we should have 3 headers: the identification, comments and codebook headers. And have a few doubts on how to deal with this:

Should we have a dedicated frame for headers or should we split the header across multiple frames together with the data? Or do you have another idea?

My idea would be to create a function uint8_t* create_header_frame(); to handle the header separately, and execute this before the begin_frame

How should we handle cases where the header size exceeds the frame size? Could split the header through multiple frame cause any problems?

Does the encoded_byte_count function need to include the size of the header, or should it only count the encoded data?

I would need to adjust all the unit tests, as well as the packetizer uses the IFrameEncoder. Does this affect any other part of the code?

gavv commented 1 month ago

Hi, sorry for delay! I was busy relocating to another country. Here are my thoughts, after reading your comments and checking out RFC and libvorbis.


Ogg stream

First of all, in your current PR you're using ogg stream and serializing ogg pages. As mentioned above, we shouldn't do it, as specified in rfc5215. We must only send vorbis packets, and don't add ogg encapsulation.

In my understanding, we'll need to use ogg_packet struct, but not ogg_stream and ogg_page and not any of the ogg_*() functions. Just send bytes from ogg_packet.


Configuration headers

Regarding the 3 headers, rfc5215 states the following:

In order to set up an initial state for the client application, the configuration MUST be conveyed via the signalling channel used to set up the session. One example of such signalling is SDP [RFC4566] with the Offer/Answer Model [RFC3264]. Changes to the configuration MAY be communicated via a re-invite, conveying a new SDP, or sent in-band in the RTP channel. Implementations MUST support an in-band delivery of updated codebooks, and SHOULD support out-of-band codebook update using a new SDP file. The changes may be due to different codebooks as well as different bitrates of the RTP stream.

In other words there are 3 strict requirements:

So I see the algorithm like this:

Sender:

1. Creates encoder
2. Obtains configuration headers (3 ogg_packet structs) from encoder
3. Serialize them according to section 3.2.1
4. Add serialized configuration to SDP
5. Passes samples to encoder, gets packets, and sends them

Receiver:

1. Creates decoder
2. Gets seriailized configuration from SDP
3. Deserializes it into 3 ogg_packet structs
4. Passes headers to decoder
5. Receives packets, passes to decoder, and extracts samples

In addition, if received packet is in-band configuration header update (section 3.1.1), we should parse it and apply configuration to encoder. I'm not 100% sure how to do it, but likely vorbis_synthesis_idheader() and vorbis_synthesis_restart() may help here.

Note: our own sender won't actually produce in-band configuration header update for now (we don't have a use case for it yet), but to conform RFC and be compatible with non-roc senders, we still should be able to handle them.

The approach above is very close to your idea with get_headers_frame()/initialize_headers(), but instead of byte buffers, we need to work with SDP (since we need codec-agnostic solution).

We can pass sdp::SessionDescription to IFrameEncoder, so that it can add its codec-specific configuration to the SDP. Then we can pass sdp::SessionDescription to IFrameDecoder, and it can read configuration from it. Same as with your approach, this should be done before encoding/decoding first packet.

Note: currently SDP is (partially) implemented in roc_sdp, but not actually used. In the release when we'll enable Vorbis, we'll also add CLI and API options to get/set SDP offer/answer. In later releases we'll add automatic SDP exchange via SAP/SDP and RTSP.


Libvorbis flow

As far as I can tell from docs & sources, libvorbis works this way:

This design places us before a few challenges which I missed before. (BTW changes that I describe below will be useful not just for vorbis, but for many other codecs like Opus).

Please let me know if I'm missing something. Bellow I assume that this is true.


Variable packet size

encoded_byte_count() was intended to return exact size of encoded payload (i.e. everything after RTP header, including Vorbis headers and compressed audio). It works fine for PCM, but apparently it doesn't fit codecs like Vorbis with variable-size packets. I guess we should rework encoder and composer interfaces. With libvorbis, we can't know packet size before we actually encode it.

Currently the algorithm is:

  1. Call IFrameEncoder::encoded_byte_count() to get payload byte size for given sample count.
  2. Call IComposer::prepare() with payload size.
  3. Encode payload into prepared packet's buffer
  4. Probably call IComposer::pad() (see below).
  5. Later, in the end of pipeline, call ICompose::compose().

We could change it to something like this:

  1. NEW: Call IFrameEncoder::max_payload_size() to get maximum possible payload byte size for given sample count.
  2. Call IComposer::prepare() with maximum payload size.
  3. Encode payload into prepared packet's buffer
  4. NEW: Ask encoder how many bytes were actually written.
  5. NEW: Call IComposer::shrink() to shrink packet's buffer from maximum payload size to actual payload size.
  6. Probably call IComposer::pad() (see below).
  7. Later, in the end of pipeline, call IComposer::compose().

To make it work, we'll need a few changes:

I suppose we'll also need similar changes in IFrameDecoder.


Multiple packets per frame

Since libvorbis is allowed to produce multiple packets for one block, we should reflect it in IFrameEncoder interface as well.

Current approach is:

  1. Pass packet's buffer to encoder.
  2. Write samples_per_packet to encoder.
  3. Tell encoder to finish packet.

We could use the following approach:

  1. Pass packet's buffer to encoder.
  2. Write samples to encoder until we write samples_per_packet samples, OR encoder reports that no more samples can be written.
  3. Tell encoder to finish packet and ask it how much bytes were actually written to packet.
  4. Ask encoder if there are any more buffered packets. If yes, create new packet, pass it's buffer to encoder, and go to step (3).

Note: the reason why we pass packet's buffer to encoder as the first step is to allow simple buffer-less encoders like PCM to use packet's buffer directly without doing extra copy.


Padding packets

Currently audio::Packetizer produces packets which all have exactly same size (defines by encoded_byte_count(samples_per_packet)). If a packet is going to be shorter than this value because of flush() call, packetizer pads it to have needed size.

This is done so because FECFRAME (the protocol which we use for loss repair) requires that all packets within FEC blocks should have same size. This requirements conflicts with vorbis (and similar codecs) where we don't control packet size.

Unfortunately I don't see a clean solution here, so I suggest the following work around:

  1. In Packetizer, continuously compute moving maximum of the packet size, i.e. maximum size of encoded packet for e.g. last 10'000 packets. (See roc_stat module, it has classes for that).
  2. To make this value more stable, round it e.g. to nearest multiple of 64 bytes.
  3. Pad every packet to have this size.
  4. If it happens that moving maximum, and hence packet size, changes in the middle of FEC block, receiver will disable loss repair until the next FEC block. This adjustment can be done later outside of this PR.

We'll need to test this approach to see how stable the packet size would be in practice and how often would it cause receiver to skip loss repair. At the first glance it seems to be a rare event.

Also let me know if you have better ideas!


Affected parts of code

IFrameEncoder is used only from Packetizer, and IFrameDecoder from Depacketizer. IFrameEncoder and IFrameDecoder currently are implemented only for PCM.

IComposer is used more widely, and have a few implementations (rtp::Composer, fec::Composer, rtcp::Composer). Though, changes in its interface and implementations are much smaller.

See also this page: https://roc-streaming.org/toolkit/docs/internals/packets_frames.html