savonet / liquidsoap

Liquidsoap is a statically typed scripting general-purpose language with dedicated operators and backend for all thing media, streaming, file generation, automation, HTTP backend and more.
http://liquidsoap.info
GNU General Public License v2.0
1.39k stars 127 forks source link

Memory leak with simple shuffle playout #4058

Closed cdhowie closed 3 weeks ago

cdhowie commented 1 month ago

Description

2.2.5 appears to have some kind of memory leak. I notice that the process starts by using about 500MB RAM (which is already a bit high IMO considering that our script doesn't do that much) and over the course of the next 24 hours grows to 1GB RAM usage. We've had to specify a hard memory limit for the container to keep it from affecting other services, which causes the container to restart about every 24h when the orchestrator notices that it's using too much RAM.

I rebuilt with jemalloc support to see if this helps and it does not.

Steps to reproduce

We are using the following script:

#!/usr/bin/env liquidsoap

def log_song(m)
  file.write(append=false, data="#{metadata.json.stringify(m)}\n", "/nowplaying/" ^ environment.get("RADIO_STATION") ^ ".json")
end

s = playlist("/music/" ^ environment.get("RADIO_STATION"), mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)

output.icecast(%mp3, mount="radio" ^ environment.get("RADIO_STATION"), password="...", fallible=true, s)

Expected behavior

Memory usage should remain stable.

Liquidsoap version

Liquidsoap 2.2.5+dev
Copyright (c) 2003-2024 Savonet team
Liquidsoap is open-source software, released under GNU General Public License.
See <http://liquidsoap.info> for more information.

Liquidsoap build config

* Liquidsoap version  : 2.2.5+dev

 * Compilation options
   - Release build       : false
   - Git SHA             : (none)
   - OCaml version       : 5.2.0
   - OS type             : Unix
   - Libs versions       : angstrom=0.16.0 bigstringaf=0.10.0 bytes=[distributed with OCaml 4.02 or above] camlp-streams camomile.lib=2.0 cry=1.0.3 curl=0.9.2 dtools=0.4.5 dune-build-info=3.16.0 dune-private-libs.dune-section=3.16.0 dune-site=3.16.0 dune-site.private=3.16.0 duppy=0.9.4 ffmpeg-av=1.1.11 ffmpeg-avcodec=1.1.11 ffmpeg-avdevice=1.1.11 ffmpeg-avfilter=1.1.11 ffmpeg-avutil=1.1.11 ffmpeg-swresample=1.1.11 ffmpeg-swscale=1.1.11 fileutils=0.6.4 gen=1.1 jemalloc lame=0.3.7 liquidsoap-lang=2.2.5 liquidsoap-lang.console=2.2.5 liquidsoap_builtins liquidsoap_core liquidsoap_ffmpeg liquidsoap_jemalloc liquidsoap_lame liquidsoap_mad liquidsoap_ogg liquidsoap_optionals liquidsoap_oss liquidsoap_runtime liquidsoap_samplerate liquidsoap_ssl liquidsoap_taglib liquidsoap_vorbis mad=0.5.3 magic-mime=1.3.1 menhirLib=20240715 metadata=0.3.0 mm=0.8.5 mm.audio=0.8.5 mm.base=0.8.5 mm.image=0.8.5 mm.midi=0.8.5 mm.video=0.8.5 ogg=0.7.4 ogg.decoder=0.7.4 pcre=7.5.0 ppx_string.runtime=v0.17.0 re=1.11.0 samplerate=0.1.6 sedlex=3.2 seq=[distributed with OCaml 4.07 or above] ssl=0.7.0 stdlib-shims=0.3.0 str=5.2.0 stringext=1.6.0 taglib=0.3.10 threads=5.2.0 unix=5.2.0 uri=4.4.0 vorbis=0.8.0 vorbis.decoder=0.8.0
   - architecture        : amd64
   - host                : x86_64-pc-linux-gnu
   - target              : x86_64-pc-linux-gnu
   - system              : linux
   - ocamlopt_cflags     : -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread
   - native_c_compiler   : gcc -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread -D_FILE_OFFSET_BITS=64
   - native_c_libraries  : -lm -lpthread

 * Configured paths
   - mode              : default
   - standard library  : (set by dune-site)
   - scripted binaries : (set by dune-site)
   - rundir            : (set by dune-site)
   - logdir            : (set by dune-site)
   - camomile files    : (set by dune-site)

 * Supported input formats
   - MP3               : yes
   - AAC               : no (requires faad)
   - Ffmpeg            : yes
   - Flac (native)     : no (requires flac)
   - Flac (ogg)        : no (requires ogg)
   - Opus              : no (requires opus)
   - Speex             : no (requires speex)
   - Theora            : no (requires theora)
   - Vorbis            : yes

 * Supported output formats
   - FDK-AAC           : no (requires fdkaac)
   - Ffmpeg            : yes
   - MP3               : yes
   - MP3 (fixed-point) : no (requires shine)
   - Flac (native)     : no (requires flac)
   - Flac (ogg)        : no (requires ogg)
   - Opus              : no (requires opus)
   - Speex             : no (requires speex)
   - Theora            : no (requires theora)
   - Vorbis            : yes

 * Tags
   - Taglib (ID3 tags) : yes
   - Vorbis            : yes

 * Input / output
   - ALSA              : no (requires alsa)
   - AO                : no (requires ao)
   - FFmpeg            : yes
   - GStreamer         : no (requires gstreamer)
   - JACK              : no (requires bjack)
   - OSS               : yes
   - Portaudio         : no (requires portaudio)
   - Pulseaudio        : no (requires pulseaudio)
   - SRT               : no (requires srt)

 * Audio manipulation
   - FFmpeg            : yes
   - LADSPA            : no (requires ladspa)
   - Lilv              : no (requires lilv)
   - Samplerate        : yes
   - SoundTouch        : no (requires soundtouch)
   - StereoTool        : no (requires ctypes-foreign)

 * Video manipulation
   - camlimages        : no (requires camlimages)
   - FFmpeg            : yes
   - frei0r            : no (requires frei0r)
   - ImageLib          : no (requires imagelib)
   - SDL               : no (requires tsdl-image & tsdl-ttf)

 * MIDI manipulation
   - DSSI              : no (requires dssi)

 * Visualization
   - GD                : no (requires gd)
   - Graphics          : no (requires graphics)
   - SDL               : no (requires tsdl-image & tsdl-ttf)

 * Additional libraries
   - FFmpeg filters    : yes
   - FFmpeg devices    : yes
   - inotify           : no (requires inotify)
   - irc               : no (requires irc-client-unix)
   - jemalloc          : yes
   - lastfm            : no (requires lastfm)
   - lo                : no (requires lo)
   - memtrace          : no (requires memtrace)
   - mem_usage         : no (requires mem_usage)
   - osc               : no (requires osc-unix)
   - ssl               : yes
   - tls               : no (requires tls-liquidsoap)
   - posix-time2       : no (requires posix)
   - windows service   : no (requires winsvc)
   - YAML support      : no (requires yaml)
   - XML playlists     : no (requires xmlplaylist)

 * Monitoring
   - Prometheus        : no (requires prometheus)

Installation method

From OPAM

Additional Info

The Dockerfile we built our image from is based on the one from this repository: https://github.com/pltnk/docker-liquidsoap

We had to modify it as it doesn't support 2.2.5 due to Ubuntu Focal having a too old version of OPAM. Here is the Dockerfile we are using to build:

FROM debian:bookworm-slim

LABEL maintainer="Kirill Plotnikov <init@pltnk.dev>" \
      github="https://github.com/pltnk/docker-liquidsoap"

ENV DEBIAN_FRONTEND=noninteractive

# install liquidsoap dependencies
RUN apt update && apt upgrade -y && \
    apt install -y \
    opam \
    gcc \
    libavcodec-dev \
    libavdevice-dev \
    libavfilter-dev \
    libavformat-dev \
    libavutil-dev \
    libjemalloc-dev \
    libswresample-dev \
    libswscale-dev \
    libcurl4-gnutls-dev \
    libmad0-dev \
    libmp3lame-dev \
    libogg-dev \
    libpcre3-dev \
    libsamplerate0-dev \
    libssl-dev \
    libtag1-dev \
    libvorbis-dev \
    m4 \
    make \
    pkg-config \
    zlib1g-dev && \
    apt autoremove && apt clean && \
    rm -rf /var/lib/apt/lists/*

ARG LIQUIDSOAP_VERSION
ARG OPAM_PACKAGES="liquidsoap${LIQUIDSOAP_VERSION:+.$LIQUIDSOAP_VERSION} taglib mad lame vorbis cry samplerate ssl ffmpeg jemalloc"

# add user for liquidsoap and create necessary directories
RUN groupadd -g 999 radio && \
    useradd -m -r -u 999 -s /bin/bash -g radio radio && \
    mkdir /etc/liquidsoap /music && \
    chown -R radio /etc/liquidsoap /music

USER radio

# setup opam
RUN opam init -a -y --disable-sandboxing

# install liquidsoap
RUN opam install -y ${OPAM_PACKAGES} && \
    eval $(opam env) && \
    opam clean -acryv --logs --unused-repositories

CMD eval $(opam env) && liquidsoap /etc/liquidsoap/script.liq
toots commented 1 month ago

Hi! Thanks for this report. I'll have a try at reproducing it very shortly.

Couple of notes:

cdhowie commented 1 month ago

We are running 4 streams using separate instances of liquidsoap, so we can certainly test 2.3.x on one of them and see how memory usage compares. Where can I obtain that Docker image, or (preferably) how can I modify my Dockerfile to build this version?

toots commented 1 month ago

Oh, one last thing: please do not test with OCaml 5.x. We have confirmed some serious memory issues with it: https://github.com/ocaml/ocaml/issues/13123. We currently recommend using the 4.14.x release branch.

cdhowie commented 1 month ago

Our Docker image is based on Debian Bookworm which has OCaml 4.13.1-4, so we're good there.

toots commented 1 month ago

We are running 4 streams using separate instances of liquidsoap, so we can certainly test 2.3.x on one of them and see how memory usage compares. Where can I obtain that Docker image, or (preferably) how can I modify my Dockerfile to build this version?

savonet/liquidsoap:rolling-release-v2.3.x is good to go for an image, otherwise, something like this:

# Needs ffmpeg 6 or above
FROM debian:trixie-slim

LABEL maintainer="Kirill Plotnikov <init@pltnk.dev>" \
      github="https://github.com/pltnk/docker-liquidsoap"

ENV DEBIAN_FRONTEND=noninteractive

# install liquidsoap dependencies
RUN apt update && apt upgrade -y && \
    apt install -y \
    opam \
    gcc \
    libavcodec-dev \
    libavdevice-dev \
    libavfilter-dev \
    libavformat-dev \
    libavutil-dev \
    libjemalloc-dev \
    libswresample-dev \
    libswscale-dev \
    libcurl4-gnutls-dev \
    libmad0-dev \
    libmp3lame-dev \
    libogg-dev \
    libpcre3-dev \
    libsamplerate0-dev \
    libssl-dev \
    libvorbis-dev \
    m4 \
    make \
    pkg-config \
    zlib1g-dev && \
    apt autoremove && apt clean && \
    rm -rf /var/lib/apt/lists/*

ARG LIQUIDSOAP_VERSION
# No jemalloc!
ARG OPAM_PACKAGES="liquidsoap${LIQUIDSOAP_VERSION:+.$LIQUIDSOAP_VERSION} mad lame vorbis cry samplerate ssl ffmpeg"

# add user for liquidsoap and create necessary directories
RUN groupadd -g 999 radio && \
    useradd -m -r -u 999 -s /bin/bash -g radio radio && \
    mkdir /etc/liquidsoap /music && \
    chown -R radio /etc/liquidsoap /music

USER radio

# setup opam with 4.14.2
RUN opam init -a -y --disable-sandboxing --comp 4.14.2

# Pin latest ocaml-ffmpeg.
RUN git clone https://github.com/savonet/ocaml-ffmpeg.git && \
         cd ocaml-ffmpeg && opam pin -ny .

# Pin latest liquidsoap
RUN git clone https://github.com/savonet/liquidsoap.git && \
         cd liquidsoap && opam pin -ny .

# install liquidsoap
RUN opam install -y ${OPAM_PACKAGES} && \
    eval $(opam env) && \
    opam clean -acryv --logs --unused-repositories

# Cache script
RUN eval $(opam env) && liquidsoap --cache-only /etc/liquidsoap/script.liq

CMD eval $(opam env) && liquidsoap /etc/liquidsoap/script.liq
toots commented 1 month ago

Our Docker image is based on Debian Bookworm which has OCaml 4.13.1-4, so we're good there.

That's not what the build config reports:

   - OCaml version       : 5.2.0
toots commented 1 month ago

You might need to repeat the pinning dance above a couple of times if you find other build failures related to modules being too old.

cdhowie commented 1 month ago

That's not what the build config reports:

Oh, that's interesting. I didn't realize that opam init actually installs OCaml! Let me try again with OCaml 4.14.2, staying with liquidsoap 2.2.5, and I'll report back after it's run long enough to see what the RAM usage growth trend is.

Thanks for the tips and feedback!

cdhowie commented 1 month ago

I got the image built and swapped 2 of the 4 stations to that image. The initial memory footprints based on the base OCaml version:

That's already a two-thirds reduction! I will let them run for about a day and report back what their memory usage is at that time.

cdhowie commented 1 month ago

After 38 hours, the 4 instances have the following memory usage:

  1. OCaml 4.14.2: 529MB
  2. OCaml 4.14.2: 585MB
  3. OCaml 5.2.0: 782MB
  4. OCaml 5.2.0: 976MB

It's not exactly clear to me why the third instance stopped at ~800MB. None of the instances seem to be growing rapidly in RAM consumption, but I can't tell yet if there is a slower long-term trend upwards.

If we look at the numbers in terms of a ratio to the initial memory usage, the 4.14.2 instances quadrupled their memory usage while the 5.2.0 instances only doubled theirs, but the absolute growth is very close. It would be nice if the 4.14.2 instances could stay closer to their initial usage of 176MB. We observed this same pattern (starting at ~128MB and growing to ~512MB) on liquidsoap 2.0.0 and is what led me to pursue upgrading to 2.2.5 in the first place (to see if the overall memory usage would be reduced).

It seems the next step is to build an image of liquidsoap 2.3.x using OCaml 4.14.2, and replace the two OCaml 5.2.0 instances with that image. Then we can compare liquidsoap 2.2.5 to 2.3.x on the same OCaml version. I assume there's a good reason you omitted jemalloc in the modified Dockerfile you sent, so I should probably also rebuild 2.2.5 without it to ensure an apples-to-apples comparison.

toots commented 1 month ago

Yes, jemalloc increases memory usage at startup. Not sure of the long-term effect but if your focus is on minimizing memory usage, I'd recommend letting go of it for now.

Another thing you should add to a custom image is memtrace. This allows dumping OCaml memory allocation. All you have to do is add a MEMTRACE environment variable pointing to a local file path and retrieve that file. See: https://blog.janestreet.com/finding-memory-leaks-with-memtrace/

toots commented 1 month ago

I did a quick docker compose:

Dockerfile:

FROM savonet/liquidsoap:rolling-release-v2.3.x

COPY audio /music/station

COPY main.liq /music/main.liq

USER root

RUN mkdir /nowplaying/

RUN chown -R liquidsoap:liquidsoap /music/station /nowplaying/

ENV RADIO_STATION=station

USER liquidsoap

RUN liquidsoap --cache-only /music/main.liq

ENTRYPOINT ["/usr/bin/liquidsoap", "/music/main.liq"]

main.liq:

#!/usr/bin/env liquidsoap

def log_song(m)
  file.write(append=false, data="#{metadata.json.stringify(m)}\n", "/nowplaying/" ^ environment.get("RADIO_STATION") ^ ".json")
end

s = playlist("/music/" ^ environment.get("RADIO_STATION"), mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)

output.icecast(%mp3, mount="radio" ^ environment.get("RADIO_STATION"), host="icecast", password="hackme", fallible=true, s)

docker-compose.yml:

version: "3.3"
services:
  liquidsoap-station:
    environment:
      MEMTRACE: /tmp/trace.ctf
    container_name: liquidsoap-station
    image: liquidsoap-station
    networks:
    - default

  icecast:
    container_name: icecast
    environment:
    - ICECAST_SOURCE_PASSWORD=hackme
    - ICECAST_ADMIN_PASSWORD=hackme
    - ICECAST_PASSWORD=hackme
    - ICECAST_RELAY_PASSWORD=hackme
    - ICECAST_HOSTNAME=icecast
    - ICECAST_MAX_SOURCES=10
    expose:
    - 8000
    image: libretime/icecast:latest
    ports:
    - 8000:8000
    networks:
    - default

Stats at startup:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
3e3694ad79bc   liquidsoap-station   12.18%    86.72MiB / 3.821GiB   2.22%     276kB / 2MB      5.51MB / 6.38MB   10
5bc0baaa5192   icecast              0.31%     2.965MiB / 3.821GiB   0.08%     1.98MB / 271kB   90.1kB / 0B       5
cdhowie commented 1 month ago

Small side note, I skipped the --cache-only step during build because we mount the script in when running the image, so there isn't any script present in the image at that time.

cdhowie commented 1 month ago

Yes, jemalloc increases memory usage at startup. Not sure of the long-term effect but if your focus is on minimizing memory usage, I'd recommend letting go of it for now.

FWIW, increasing the initial footprint is fine if it reduces the long-term peak. I don't know if it actually does as I haven't done a side-by-side comparison with vs. without it, but an extra 50MB at startup is not a huge deal if it reduces the maximum usage after it's been running awhile by the same or more.

toots commented 1 month ago

I would also recommend to enable the telnet server (add settings.server.telnet := true to your script) and do the following on the running docker container:

% docker exec -ti d3fc3d601956 /bin/bash
# Might need to run as root and install telnet..
$ telnet localhost 1234
runtime.memory
Physical memory: 160.93 MB
Private memory: 80.04 MB
Managed memory: 57.29 MB
Swapped memory: 0 B
END
runtime.gc.compact
Done!
END
runtime.memory
Physical memory: 158.85 MB
Private memory: 77.93 MB
Managed memory: 49.81 MB
Swapped memory: 0 B
END

This will inform you if the memory fluctuation you see are due to the OCaml GC having pending stuff to collect.

toots commented 1 month ago

As I mentioned earlier, memory reporting and tracking is complicated. In the case of docker, I have seen two run of the same image initially report 200Mo or 85Mo.

Beside pagination and shared memory, there seems to be another layer of abstraction in how docker reports memory.

Typically, in the logs above, you might notice that the physical memory of the process is about 160Mo. Yet, the docker stats report between 80Mo and 90Mo. This is because it substracts shared memory. The end-result is more-or-less consistent with the reported private memory from liquidsoap's internals.

More links re: docker memory reporting:

I don't know if your script processes a lot of files but the last link is about an investigation about a memory leak related to creating files. The suggested workaround, echo 3 | sudo tee /proc/sys/vm/drop_caches seems to have an impact.

All that to say, for memory reporting I would suggest:

cdhowie commented 1 month ago

We are deploying in a Kubernetes cluster using the containerd CRI, so Docker is not in the picture. :) (I have also found docker stats to be unreliable in terms of how useful its memory usage information is.)

One of the 2.3.0 containers was restarted when liquidsoap segfaulted, so for now we probably need to switch everything back to 2.2.5. I will do one final test comparing with and without jemalloc to see if it makes any difference. (And I'll gather some stats from the telnet server as well just to be sure.)

toots commented 1 month ago

Well most of the remarks above are related to memory management in general not just docker. How do you get memory usage report? Do you know if those reports take into account shared memory?

Equally important: do you know if your container runners are able to re-use shared memory? In other term: do each of your running container load the shared libraries in memory or are they able to re-use the same shared memory?

Shared memory in docker containers, typically, depend on the storage driver and this is one of the less documented topic about it. See: https://stackoverflow.com/questions/35863608/shared-library-in-containers/40096194#40096194. It would be interesting to see if your container runners are behaving correctly in that regard as well.

Last, I would love a segfault stacktrace if you have one.. 🙂

cdhowie commented 1 month ago

I would expect any sharing of the libraries to happen during startup and thus contribute to the initial ~150MB of RAM, which is not really the usage that interests me -- it's the next ~350MB allocated that I find interesting. It's also worth noting that not all of the containers are guaranteed to run on the same physical host, so we must assume a worst-case scenario where no memory is shared at all when budgeting RAM. Still, I will gather some stats from the telnet interface to see how they compare to those reported by Kubernetes.

If you can tell me how to obtain the segfault stacktrace, I will happily provide it.

toots commented 1 month ago

Sounds good. Yes, I agree on the growth. It would be good if you can run that memory compaction command and also the kernel drop_cache call to see if the growth is due to some memory accumulating somewhere.

At face value, I suspect something in the dynamic behavior of the script. The code you have is really minimal and I would be surprised (though this has happened in the past) if there was a hard memory leak like lost memory allocations.

I would be tempted to look either at the OCaml GC not doing its job or something accumulating somewhere. Typically, I wonder if data can accumulate on the sending part, when sending data to icecast.

Oh, yes, also, using the MEMTRACE memory trace functionality would inform us if the memory is accumulating in the OCaml side. This actually is probably the fastest way to identifying the source of the issue.

vitoyucepi commented 1 month ago

Hi @cdhowie, May I ask you what kind of thinking you do? Could you collect the statistics?

  1. Analyze music, may take long time. find . -type f -exec ffprobe -hide_banner -print_format json -show_format -show_streams '{}' \; 2>/dev/null > /tmp/music.json
  2. Extract statistics. < /tmp/music.json jq -c '{format: .format.format_name, streams: [.streams[].codec_name]}' | sort | uniq -c

I expect the result to look like this

10 {"format":"mp3","streams":["mp3"]}
20 {"format":"mp3","streams":["mp3","mjpeg"]}
30 {"format":"mp3","streams":["mp3","png"]}
40 {"format":"ogg","streams":["vorbis"]}
toots commented 1 month ago

Things seem stable here with 2.3.x:

docker stats --no-stream
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
773290dca97b   icecast              0.27%     2.891MiB / 3.821GiB   0.07%     1.25GB / 171MB   2.25MB / 1.6MB   5
d3fc3d601956   liquidsoap-station   8.25%     84.24MiB / 3.821GiB   2.15%     173MB / 1.25GB   1.1GB / 641MB    10

The kernel cache flush does seem to have an effect, as well as compacting the GC memory.

toots commented 1 month ago

Things are stable here, I think I'll give it another 12/24h and will shut it down:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
773290dca97b   icecast              0.27%     2.891MiB / 3.821GiB   0.07%     1.99GB / 271MB   2.7MB / 2.48MB   5
d3fc3d601956   liquidsoap-station   10.54%    87.65MiB / 3.821GiB   2.24%     274MB / 1.98GB   1.51GB / 662MB   10
toots commented 1 month ago

Today I can see a slight increase in memory usage:

Physical memory: 181.65 MB
Private memory: 102.29 MB
Managed memory: 73.34 MB
Swapped memory: 0 B
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
773290dca97b   icecast              0.27%     2.891MiB / 3.821GiB   0.07%     2.74GB / 374MB   4.02MB / 3.38MB   5
d3fc3d601956   liquidsoap-station   11.78%    99.99MiB / 3.821GiB   2.56%     377MB / 2.73GB   1.97GB / 674MB    10

The `memtrace report does not have anything on the OCaml side but it only samples a small fraction of the allocations for performance so we might need to wait more to get more conclusive results:

Screenshot 2024-07-29 at 8 03 35 AM
toots commented 1 month ago

Things seem stable here but only after running a couple of memory compaction and cache flush:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
773290dca97b   icecast              0.27%     2.891MiB / 3.821GiB   0.07%     4.38GB / 597MB   6.77MB / 5.39MB   5
d3fc3d601956   liquidsoap-station   11.71%    100.2MiB / 3.821GiB   2.56%     603MB / 4.37GB   2.96GB / 696MB    10
runtime.memory
Physical memory: 182.01 MB
Private memory: 102.47 MB
Managed memory: 73.34 MB
Swapped memory: 0 B
cdhowie commented 1 month ago

It would be good if you can run that memory compaction command and also the kernel drop_cache call to see if the growth is due to some memory accumulating somewhere.

I'm going to skip the drop_cache call because that evicts the I/O cache across the whole machine, and liquidsoap is not the only service running on these nodes. This could therefore affect performance of other services. (It also shouldn't change the memory used by each process, AIUI, but only the amount consumed by the kernel for various caches.)

After running for a few days (I was on vacation over the weekend so that gave it some time to run and have the RAM usage settle), here are my stats.

The services without jemalloc started around 100MB and those with jemalloc started around 187MB.

After 2 days, the services without jemalloc are both using around 470MB. The services with jemalloc are using around 520MB.

At least for this short of a test, jemalloc therefore seems to be harmful overall. Perhaps if running for months it would help with memory fragmentation, but at the moment I think it's better not to use it.

I tried to run the commands over the telnet interface but I missed your messages about building with memtrace so those commands aren't available. If you would like I can build another image and run these tests again.

@vitoyucepi All of the files we're using are MP3s.

365 {"format":"mp3","streams":["mp3","mjpeg"]}
toots commented 1 month ago

Thanks for the feedback! Flushing the kernel cache definitly has an influence on the reported memory via docker stats:

$ docker stats --no-stream
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
773290dca97b   icecast              0.26%     2.93MiB / 3.821GiB    0.07%     4.81GB / 656MB   7.22MB / 5.89MB   5
d3fc3d601956   liquidsoap-station   12.97%    108.6MiB / 3.821GiB   2.78%     663MB / 4.8GB    3.28GB / 703MB    10
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ docker stats --no-stream
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
773290dca97b   icecast              0.27%     2.891MiB / 3.821GiB   0.07%     4.81GB / 656MB   7.22MB / 5.9MB   5
d3fc3d601956   liquidsoap-station   12.12%    106.9MiB / 3.821GiB   2.73%     663MB / 4.8GB    3.28GB / 703MB   10

I'm still seeing a slight increase of memory here so I'm gonna keep on investigating. However, it's nothing like what you're seeing.

I'm not sure where you issue comes from, TBH. I would suggest to try to see if you can reproduce outside of the docker/container abstraction first.

vitoyucepi commented 1 month ago

@cdhowie, Previously we had #3504, since then I've been using

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

to disable image decoding and use only ffmpeg. @toots, am I doing it right?

Is it possible for you to build and run liquidsoap without

I think you could use

But you have to adjust the encoder in your script. Instead of %mp3 use something like %ffmpeg(format="mp3", %audio(codec="libmp3lame")).

cdhowie commented 1 month ago

@vitoyucepi But it looks like this was fixed in 2.2.4, and we still see slow and steady growth up to ~500MB with 2.2.5, so we should not be hitting this exact issue, no?

vitoyucepi commented 1 month ago

should not be hitting this exact issue

You may never know. Enabling memtrace may reveal the cause of the problem.

toots commented 1 month ago

Memory was stubbornly stable on a plain process using macos instrumentation to keep track of memory. I think next I'll investigate with a plain process on a linux machine.

I do agree with @vitoyucepi, the metadata idea would be good to get out of the way. @cdhowie so you know if your files have cover metadata in them?

cdhowie commented 1 month ago

Some of the files do have cover metadata, yes. I don't recall if we actually consume this data downstream; we might, so I'm hesitant to do anything that would disable extracting it at the moment.

I have been running with memtrace for about 40 hours. I notice that 2.2.5 doesn't seem to have the runtime.memory telnet command, so I can't run that. I don't know if the memtrace files would contain any potentially sensitive data, so I don't know how wise it would be to attach them directly to this issue.

cdhowie commented 1 month ago

I should say though that I could spin up a fifth instance with changes to disable cover metadata extraction, publishing to an endpoint that just won't have any listeners. From the perspective of liquidsoap it should look just the same, and then we could at least compare memory usage over time.

toots commented 1 month ago

Thanks. They should not but in case you can email them directly to me at toots@rastageeks.org

toots commented 1 month ago

Unfortunately, the OCaml memtrace logs are pretty flat:

Screenshot 2024-08-03 at 8 47 11 AM

However, memtrace only captures a fraction of the allocations, in order to be able to work against production systems.

Thank you for your patience with this @cdhowie. I suppose filtering cover metadata before writting your JSON file would be an easy test while I think of other ways to trac this:

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
cdhowie commented 1 month ago

I've set up a fifth "hidden" station running with this config so we can compare memory usage. I'll let you know after it's been running for awhile how memory usage compares.

cdhowie commented 1 month ago

After running for 8 hours, memory usage with the settings provided above do not meaningfully impact memory use. Out of the 5 instances, the lowest and highest both do not have the settings modifications and are sitting at 276MB and 310MB memory usage. The instance with the settings modifications is at 306MB.

vitoyucepi commented 1 month ago

Hi @cdhowie, What's the script for the 5th instance, could you please share it?

cdhowie commented 1 month ago
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

def log_song(m)
  file.write(append=false, data="#{metadata.json.stringify(m)}\n", "/nowplaying/test" ^ environment.get("RADIO_STATION") ^ ".json")
end

s = playlist("/music/" ^ environment.get("RADIO_STATION"), mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)

output.icecast(%mp3, mount="radiotest" ^ environment.get("RADIO_STATION"), password="...", fallible=true, s)
vitoyucepi commented 1 month ago

Previously, I advised you to rebuild liquidsoap without extra dependencies in this message https://github.com/savonet/liquidsoap/issues/4058#issuecomment-2259146171. May I ask you to do that and also replace the %mp3 encoder with %ffmpeg(format="mp3", %audio(codec="libmp3lame"))?

The script should look similar to

settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

def log_song(m)
  file.write(
    '/nowplaying/test#{environment.get("RADIO_STATION")}.json',
    data="#{metadata.json.stringify(m)}\n"
    append=false,
  )
end

s = playlist('/music/#{environment.get("RADIO_STATION")}', mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)

enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
  enc,
  s,
  mount='radiotest#{environment.get("RADIO_STATION")}',
  password="...",
  fallible=true
)
cdhowie commented 1 month ago

I've rebuilt the image without those dependencies and modified the test station to use that image.

After 13 hours, memory usage is lower but not as much as indicated on the issue about leaks related to processing cover art. It's sitting at 290MB where the min/max of the other 4 instances is 295-366MB. It's possible this is just random variance, and it's still climbed quite a bit from the initial usage of ~128MB. I'll let it run another 12 hours and let you know what memory usage looks like then.

Interestingly, CPU usage is about 15% lower on average than the other 4 instances. The ffmpeg-based encoder seems more efficient in this regard, so even if memory usage isn't significantly improved, it may be worth us switching to the ffmpeg encoder just for that.

vitoyucepi commented 1 month ago

I think the next step should be to remove crossfade, nrj and on_track.

settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

s = playlist('/music/#{environment.get("RADIO_STATION")}', mode="random", reload=300)

enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
  enc,
  s,
  mount='radiotest#{environment.get("RADIO_STATION")}',
  password="...",
  fallible=true
)

Perhaps the problem here is mode="random", which differs from the default randomize mode, combined with reload=300.

settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

s = playlist('/music/#{environment.get("RADIO_STATION")}')

enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
  enc,
  s,
  mount='radiotest#{environment.get("RADIO_STATION")}',
  password="...",
  fallible=true
)

After that, you'll probably want to disable the output.

settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true

settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]

s = playlist('/music/#{environment.get("RADIO_STATION")}')

output.dummy(s, fallible=true)
toots commented 1 month ago

@cdhowie I forgot to mention. You can set the OCaml GC to a more agressive settings to see if the memory is accumulating because the GC is too slow to reclaim memory:

runtime.gc.set(runtime.gc.get().{ space_overhead = 20 })
cdhowie commented 1 month ago

After 24h without crossfade/nrj/on_track the memory usage again is lower than the minimum of the other 4 instances but only by 10MB, so this is probably random variance.

Now trying with the playlist mode change.

@toots I can throw that in after running this round of tests; I don't want to change two things at once.

cdhowie commented 4 weeks ago

@vitoyucepi I tried each suggestion in order and after about 24 hours, none of them resulted in any significant change to memory usage.

I am going to revert to the original script and try @toots' suggestion next.

cdhowie commented 3 weeks ago

After doing a whole bunch of testing given the instructions/ideas above, I started digging into more granular memory usage statistics, which I probably should have done a lot sooner (per this comment), to find out that cgroup memory consumption information includes the page cache of the cgroup, which is the opposite of what I understood to be true before. (I expected the page cache to be global across the entire kernel and not attributed to individual cgroups, which is why I did not dig into this sooner. I'm not exactly sure where I acquired this misinformation... maybe this is how it worked with cgroups v1, or a different tool entirely.)

Watching the usage of individual processes and the cgroup as a whole more closely, I found that the cgroup's memory usage seems to peak around ~300MB during startup and then drops to around ~110MB almost immediately. Liquidsoap's RSS settles around 180MB of usage after it's been running for awhile. All of the other usage above 180MB reported by k8s is likely attributable to the page cache holding the audio files in memory.

Doing some further research, I found that a cgroup will begin evicting entries in the page cache once the total memory consumption of the cgroup approaches the cgroup's configured memory limit. All this time I've been testing without a memory limit, as the k8s documentation indicates that a process will be killed if it tries to allocate above the configured memory limit (which I did not want to happen), and I wasn't sure yet where usage would eventually settle. It seems that this backfired, and the absence of a memory limit encouraged the kernel to keep all of the music files on disk in the page cache for as long as possible, which counted against the container as memory usage. (Technically this would not really be a problem in a "traditional" deployment, but within k8s this contributes to memory pressure on the node where the container is deployed; in other words, limiting the page cache's consumption here is a good idea as it allows that memory to be used by other pods.)

I initially tried adding a 256MB memory limit but this failed immediately due to the ~300MB peak consumption during startup, so I raised the limit to 384MB and let it run for a few days, and the results are that the containers are indeed staying within the 384MB usage limit and the RSS of liquidsoap itself is within 180-190MB.

The conclusion is that there does not appear to be any issue with liquidsoap itself. I sincerely apologize for wasting everyone's time, and I very much appreciate all of the time that you all took to investigate this for me. In appreciation of the time you've spent both on this issue and in maintaining liquidsoap in general, my company will be making a donation to the project later today.


Further reading: https://github.com/kubernetes/kubernetes/issues/43916

toots commented 3 weeks ago

Thank you so much for your patience and for getting to the bottom of this. This is very valuable information, I've been waiting for this to settle to write that blog post about memory usage!

The memory peak that you observe at startup is due to the script being typechecked. If you are able to utilize the new caching functionality, this should make it possible to skip this step and set a more tightened memory limit.

cdhowie commented 3 weeks ago

Thank you so much for your patience

I feel like I should be saying that to you! 🙂

The memory peak that you observe at startup is due to the script being typechecked. If you are able to utilize the new caching functionality, this should make it possible to skip this step and set a more tightened memory limit.

Oh, that's very good to know. All of our stations use exactly the same script but with different supplied environment variables, so we should be able to create an emptyDir volume to mount to LIQ_CACHE_USER_DIR, and populate that cache with an initContainer once for all of the stations.

However, we will probably need to wait until the 2.3 release to implement this as the segfault we experienced trying the 2.3 branch makes me very hesitant to use this in production just yet.