Closed cdhowie closed 3 months ago
Hi! Thanks for this report. I'll have a try at reproducing it very shortly.
Couple of notes:
2.3.x
? The rolling-release-v2.3.x
image just came out and should be ready for testing. We are focusing our effort on fixing any issue there now, ahead of the release, so any work there would be more beneficial to everyone!We are running 4 streams using separate instances of liquidsoap, so we can certainly test 2.3.x on one of them and see how memory usage compares. Where can I obtain that Docker image, or (preferably) how can I modify my Dockerfile to build this version?
Oh, one last thing: please do not test with OCaml 5.x
. We have confirmed some serious memory issues with it: https://github.com/ocaml/ocaml/issues/13123. We currently recommend using the 4.14.x
release branch.
Our Docker image is based on Debian Bookworm which has OCaml 4.13.1-4, so we're good there.
We are running 4 streams using separate instances of liquidsoap, so we can certainly test 2.3.x on one of them and see how memory usage compares. Where can I obtain that Docker image, or (preferably) how can I modify my Dockerfile to build this version?
savonet/liquidsoap:rolling-release-v2.3.x
is good to go for an image, otherwise, something like this:
# Needs ffmpeg 6 or above
FROM debian:trixie-slim
LABEL maintainer="Kirill Plotnikov <init@pltnk.dev>" \
github="https://github.com/pltnk/docker-liquidsoap"
ENV DEBIAN_FRONTEND=noninteractive
# install liquidsoap dependencies
RUN apt update && apt upgrade -y && \
apt install -y \
opam \
gcc \
libavcodec-dev \
libavdevice-dev \
libavfilter-dev \
libavformat-dev \
libavutil-dev \
libjemalloc-dev \
libswresample-dev \
libswscale-dev \
libcurl4-gnutls-dev \
libmad0-dev \
libmp3lame-dev \
libogg-dev \
libpcre3-dev \
libsamplerate0-dev \
libssl-dev \
libvorbis-dev \
m4 \
make \
pkg-config \
zlib1g-dev && \
apt autoremove && apt clean && \
rm -rf /var/lib/apt/lists/*
ARG LIQUIDSOAP_VERSION
# No jemalloc!
ARG OPAM_PACKAGES="liquidsoap${LIQUIDSOAP_VERSION:+.$LIQUIDSOAP_VERSION} mad lame vorbis cry samplerate ssl ffmpeg"
# add user for liquidsoap and create necessary directories
RUN groupadd -g 999 radio && \
useradd -m -r -u 999 -s /bin/bash -g radio radio && \
mkdir /etc/liquidsoap /music && \
chown -R radio /etc/liquidsoap /music
USER radio
# setup opam with 4.14.2
RUN opam init -a -y --disable-sandboxing --comp 4.14.2
# Pin latest ocaml-ffmpeg.
RUN git clone https://github.com/savonet/ocaml-ffmpeg.git && \
cd ocaml-ffmpeg && opam pin -ny .
# Pin latest liquidsoap
RUN git clone https://github.com/savonet/liquidsoap.git && \
cd liquidsoap && opam pin -ny .
# install liquidsoap
RUN opam install -y ${OPAM_PACKAGES} && \
eval $(opam env) && \
opam clean -acryv --logs --unused-repositories
# Cache script
RUN eval $(opam env) && liquidsoap --cache-only /etc/liquidsoap/script.liq
CMD eval $(opam env) && liquidsoap /etc/liquidsoap/script.liq
Our Docker image is based on Debian Bookworm which has OCaml 4.13.1-4, so we're good there.
That's not what the build config reports:
- OCaml version : 5.2.0
You might need to repeat the pinning dance above a couple of times if you find other build failures related to modules being too old.
That's not what the build config reports:
Oh, that's interesting. I didn't realize that opam init
actually installs OCaml! Let me try again with OCaml 4.14.2, staying with liquidsoap 2.2.5, and I'll report back after it's run long enough to see what the RAM usage growth trend is.
Thanks for the tips and feedback!
I got the image built and swapped 2 of the 4 stations to that image. The initial memory footprints based on the base OCaml version:
That's already a two-thirds reduction! I will let them run for about a day and report back what their memory usage is at that time.
After 38 hours, the 4 instances have the following memory usage:
It's not exactly clear to me why the third instance stopped at ~800MB. None of the instances seem to be growing rapidly in RAM consumption, but I can't tell yet if there is a slower long-term trend upwards.
If we look at the numbers in terms of a ratio to the initial memory usage, the 4.14.2 instances quadrupled their memory usage while the 5.2.0 instances only doubled theirs, but the absolute growth is very close. It would be nice if the 4.14.2 instances could stay closer to their initial usage of 176MB. We observed this same pattern (starting at ~128MB and growing to ~512MB) on liquidsoap 2.0.0 and is what led me to pursue upgrading to 2.2.5 in the first place (to see if the overall memory usage would be reduced).
It seems the next step is to build an image of liquidsoap 2.3.x using OCaml 4.14.2, and replace the two OCaml 5.2.0 instances with that image. Then we can compare liquidsoap 2.2.5 to 2.3.x on the same OCaml version. I assume there's a good reason you omitted jemalloc in the modified Dockerfile you sent, so I should probably also rebuild 2.2.5 without it to ensure an apples-to-apples comparison.
Yes, jemalloc increases memory usage at startup. Not sure of the long-term effect but if your focus is on minimizing memory usage, I'd recommend letting go of it for now.
Another thing you should add to a custom image is memtrace
. This allows dumping OCaml memory allocation. All you have to do is add a MEMTRACE
environment variable pointing to a local file path and retrieve that file. See: https://blog.janestreet.com/finding-memory-leaks-with-memtrace/
I did a quick docker compose:
Dockerfile
:
FROM savonet/liquidsoap:rolling-release-v2.3.x
COPY audio /music/station
COPY main.liq /music/main.liq
USER root
RUN mkdir /nowplaying/
RUN chown -R liquidsoap:liquidsoap /music/station /nowplaying/
ENV RADIO_STATION=station
USER liquidsoap
RUN liquidsoap --cache-only /music/main.liq
ENTRYPOINT ["/usr/bin/liquidsoap", "/music/main.liq"]
main.liq
:
#!/usr/bin/env liquidsoap
def log_song(m)
file.write(append=false, data="#{metadata.json.stringify(m)}\n", "/nowplaying/" ^ environment.get("RADIO_STATION") ^ ".json")
end
s = playlist("/music/" ^ environment.get("RADIO_STATION"), mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)
output.icecast(%mp3, mount="radio" ^ environment.get("RADIO_STATION"), host="icecast", password="hackme", fallible=true, s)
docker-compose.yml
:
version: "3.3"
services:
liquidsoap-station:
environment:
MEMTRACE: /tmp/trace.ctf
container_name: liquidsoap-station
image: liquidsoap-station
networks:
- default
icecast:
container_name: icecast
environment:
- ICECAST_SOURCE_PASSWORD=hackme
- ICECAST_ADMIN_PASSWORD=hackme
- ICECAST_PASSWORD=hackme
- ICECAST_RELAY_PASSWORD=hackme
- ICECAST_HOSTNAME=icecast
- ICECAST_MAX_SOURCES=10
expose:
- 8000
image: libretime/icecast:latest
ports:
- 8000:8000
networks:
- default
Stats at startup:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3e3694ad79bc liquidsoap-station 12.18% 86.72MiB / 3.821GiB 2.22% 276kB / 2MB 5.51MB / 6.38MB 10
5bc0baaa5192 icecast 0.31% 2.965MiB / 3.821GiB 0.08% 1.98MB / 271kB 90.1kB / 0B 5
Small side note, I skipped the --cache-only
step during build because we mount the script in when running the image, so there isn't any script present in the image at that time.
Yes, jemalloc increases memory usage at startup. Not sure of the long-term effect but if your focus is on minimizing memory usage, I'd recommend letting go of it for now.
FWIW, increasing the initial footprint is fine if it reduces the long-term peak. I don't know if it actually does as I haven't done a side-by-side comparison with vs. without it, but an extra 50MB at startup is not a huge deal if it reduces the maximum usage after it's been running awhile by the same or more.
I would also recommend to enable the telnet server (add settings.server.telnet := true
to your script) and do the following on the running docker container:
% docker exec -ti d3fc3d601956 /bin/bash
# Might need to run as root and install telnet..
$ telnet localhost 1234
runtime.memory
Physical memory: 160.93 MB
Private memory: 80.04 MB
Managed memory: 57.29 MB
Swapped memory: 0 B
END
runtime.gc.compact
Done!
END
runtime.memory
Physical memory: 158.85 MB
Private memory: 77.93 MB
Managed memory: 49.81 MB
Swapped memory: 0 B
END
This will inform you if the memory fluctuation you see are due to the OCaml GC
having pending stuff to collect.
As I mentioned earlier, memory reporting and tracking is complicated. In the case of docker, I have seen two run of the same image initially report 200Mo
or 85Mo
.
Beside pagination and shared memory, there seems to be another layer of abstraction in how docker reports memory.
Typically, in the logs above, you might notice that the physical memory of the process is about 160Mo
. Yet, the docker stats report between 80Mo
and 90Mo
. This is because it substracts shared memory. The end-result is more-or-less consistent with the reported private memory from liquidsoap's internals.
More links re: docker memory reporting:
I don't know if your script processes a lot of files but the last link is about an investigation about a memory leak related to creating files. The suggested workaround, echo 3 | sudo tee /proc/sys/vm/drop_caches
seems to have an impact.
All that to say, for memory reporting I would suggest:
docker stats
bindly.gc
and kernel drop_cache
regularly to clear any memory waiting to be freed.We are deploying in a Kubernetes cluster using the containerd CRI, so Docker is not in the picture. :) (I have also found docker stats
to be unreliable in terms of how useful its memory usage information is.)
One of the 2.3.0 containers was restarted when liquidsoap segfaulted, so for now we probably need to switch everything back to 2.2.5. I will do one final test comparing with and without jemalloc to see if it makes any difference. (And I'll gather some stats from the telnet server as well just to be sure.)
Well most of the remarks above are related to memory management in general not just docker. How do you get memory usage report? Do you know if those reports take into account shared memory?
Equally important: do you know if your container runners are able to re-use shared memory? In other term: do each of your running container load the shared libraries in memory or are they able to re-use the same shared memory?
Shared memory in docker containers, typically, depend on the storage driver and this is one of the less documented topic about it. See: https://stackoverflow.com/questions/35863608/shared-library-in-containers/40096194#40096194. It would be interesting to see if your container runners are behaving correctly in that regard as well.
Last, I would love a segfault stacktrace if you have one.. 🙂
I would expect any sharing of the libraries to happen during startup and thus contribute to the initial ~150MB of RAM, which is not really the usage that interests me -- it's the next ~350MB allocated that I find interesting. It's also worth noting that not all of the containers are guaranteed to run on the same physical host, so we must assume a worst-case scenario where no memory is shared at all when budgeting RAM. Still, I will gather some stats from the telnet interface to see how they compare to those reported by Kubernetes.
If you can tell me how to obtain the segfault stacktrace, I will happily provide it.
Sounds good. Yes, I agree on the growth. It would be good if you can run that memory compaction command and also the kernel drop_cache
call to see if the growth is due to some memory accumulating somewhere.
At face value, I suspect something in the dynamic behavior of the script. The code you have is really minimal and I would be surprised (though this has happened in the past) if there was a hard memory leak like lost memory allocations.
I would be tempted to look either at the OCaml GC not doing its job or something accumulating somewhere. Typically, I wonder if data can accumulate on the sending part, when sending data to icecast.
Oh, yes, also, using the MEMTRACE
memory trace functionality would inform us if the memory is accumulating in the OCaml side. This actually is probably the fastest way to identifying the source of the issue.
Hi @cdhowie, May I ask you what kind of thinking you do? Could you collect the statistics?
find . -type f -exec ffprobe -hide_banner -print_format json -show_format -show_streams '{}' \; 2>/dev/null > /tmp/music.json
< /tmp/music.json jq -c '{format: .format.format_name, streams: [.streams[].codec_name]}' | sort | uniq -c
I expect the result to look like this
10 {"format":"mp3","streams":["mp3"]}
20 {"format":"mp3","streams":["mp3","mjpeg"]}
30 {"format":"mp3","streams":["mp3","png"]}
40 {"format":"ogg","streams":["vorbis"]}
Things seem stable here with 2.3.x
:
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.27% 2.891MiB / 3.821GiB 0.07% 1.25GB / 171MB 2.25MB / 1.6MB 5
d3fc3d601956 liquidsoap-station 8.25% 84.24MiB / 3.821GiB 2.15% 173MB / 1.25GB 1.1GB / 641MB 10
The kernel cache flush does seem to have an effect, as well as compacting the GC memory.
Things are stable here, I think I'll give it another 12/24h and will shut it down:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.27% 2.891MiB / 3.821GiB 0.07% 1.99GB / 271MB 2.7MB / 2.48MB 5
d3fc3d601956 liquidsoap-station 10.54% 87.65MiB / 3.821GiB 2.24% 274MB / 1.98GB 1.51GB / 662MB 10
Today I can see a slight increase in memory usage:
Physical memory: 181.65 MB
Private memory: 102.29 MB
Managed memory: 73.34 MB
Swapped memory: 0 B
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.27% 2.891MiB / 3.821GiB 0.07% 2.74GB / 374MB 4.02MB / 3.38MB 5
d3fc3d601956 liquidsoap-station 11.78% 99.99MiB / 3.821GiB 2.56% 377MB / 2.73GB 1.97GB / 674MB 10
The `memtrace report does not have anything on the OCaml side but it only samples a small fraction of the allocations for performance so we might need to wait more to get more conclusive results:
Things seem stable here but only after running a couple of memory compaction and cache flush:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.27% 2.891MiB / 3.821GiB 0.07% 4.38GB / 597MB 6.77MB / 5.39MB 5
d3fc3d601956 liquidsoap-station 11.71% 100.2MiB / 3.821GiB 2.56% 603MB / 4.37GB 2.96GB / 696MB 10
runtime.memory
Physical memory: 182.01 MB
Private memory: 102.47 MB
Managed memory: 73.34 MB
Swapped memory: 0 B
It would be good if you can run that memory compaction command and also the kernel
drop_cache
call to see if the growth is due to some memory accumulating somewhere.
I'm going to skip the drop_cache
call because that evicts the I/O cache across the whole machine, and liquidsoap is not the only service running on these nodes. This could therefore affect performance of other services. (It also shouldn't change the memory used by each process, AIUI, but only the amount consumed by the kernel for various caches.)
After running for a few days (I was on vacation over the weekend so that gave it some time to run and have the RAM usage settle), here are my stats.
The services without jemalloc started around 100MB and those with jemalloc started around 187MB.
After 2 days, the services without jemalloc are both using around 470MB. The services with jemalloc are using around 520MB.
At least for this short of a test, jemalloc therefore seems to be harmful overall. Perhaps if running for months it would help with memory fragmentation, but at the moment I think it's better not to use it.
I tried to run the commands over the telnet interface but I missed your messages about building with memtrace
so those commands aren't available. If you would like I can build another image and run these tests again.
@vitoyucepi All of the files we're using are MP3s.
365 {"format":"mp3","streams":["mp3","mjpeg"]}
Thanks for the feedback! Flushing the kernel cache definitly has an influence on the reported memory via docker stats
:
$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.26% 2.93MiB / 3.821GiB 0.07% 4.81GB / 656MB 7.22MB / 5.89MB 5
d3fc3d601956 liquidsoap-station 12.97% 108.6MiB / 3.821GiB 2.78% 663MB / 4.8GB 3.28GB / 703MB 10
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
773290dca97b icecast 0.27% 2.891MiB / 3.821GiB 0.07% 4.81GB / 656MB 7.22MB / 5.9MB 5
d3fc3d601956 liquidsoap-station 12.12% 106.9MiB / 3.821GiB 2.73% 663MB / 4.8GB 3.28GB / 703MB 10
I'm still seeing a slight increase of memory here so I'm gonna keep on investigating. However, it's nothing like what you're seeing.
I'm not sure where you issue comes from, TBH. I would suggest to try to see if you can reproduce outside of the docker/container abstraction first.
@cdhowie, Previously we had #3504, since then I've been using
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
to disable image decoding and use only ffmpeg
. @toots, am I doing it right?
Is it possible for you to build and run liquidsoap without
taglib
,mad
,lame
,vorbis
,samplerate
,ssl
,jemalloc
?I think you could use
ffmpeg
,memtrace
,mem_usage
,posix-time2
(optional),prometheus
(optional).But you have to adjust the encoder in your script. Instead of %mp3
use something like %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
.
@vitoyucepi But it looks like this was fixed in 2.2.4, and we still see slow and steady growth up to ~500MB with 2.2.5, so we should not be hitting this exact issue, no?
should not be hitting this exact issue
You may never know. Enabling memtrace may reveal the cause of the problem.
Memory was stubbornly stable on a plain process using macos
instrumentation to keep track of memory. I think next I'll investigate with a plain process on a linux machine.
I do agree with @vitoyucepi, the metadata idea would be good to get out of the way. @cdhowie so you know if your files have cover metadata in them?
Some of the files do have cover metadata, yes. I don't recall if we actually consume this data downstream; we might, so I'm hesitant to do anything that would disable extracting it at the moment.
I have been running with memtrace for about 40 hours. I notice that 2.2.5 doesn't seem to have the runtime.memory
telnet command, so I can't run that. I don't know if the memtrace files would contain any potentially sensitive data, so I don't know how wise it would be to attach them directly to this issue.
I should say though that I could spin up a fifth instance with changes to disable cover metadata extraction, publishing to an endpoint that just won't have any listeners. From the perspective of liquidsoap it should look just the same, and then we could at least compare memory usage over time.
Thanks. They should not but in case you can email them directly to me at toots@rastageeks.org
Unfortunately, the OCaml memtrace logs are pretty flat:
However, memtrace only captures a fraction of the allocations, in order to be able to work against production systems.
Thank you for your patience with this @cdhowie. I suppose filtering cover metadata before writting your JSON file would be an easy test while I think of other ways to trac this:
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
I've set up a fifth "hidden" station running with this config so we can compare memory usage. I'll let you know after it's been running for awhile how memory usage compares.
After running for 8 hours, memory usage with the settings provided above do not meaningfully impact memory use. Out of the 5 instances, the lowest and highest both do not have the settings modifications and are sitting at 276MB and 310MB memory usage. The instance with the settings modifications is at 306MB.
Hi @cdhowie, What's the script for the 5th instance, could you please share it?
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
def log_song(m)
file.write(append=false, data="#{metadata.json.stringify(m)}\n", "/nowplaying/test" ^ environment.get("RADIO_STATION") ^ ".json")
end
s = playlist("/music/" ^ environment.get("RADIO_STATION"), mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)
output.icecast(%mp3, mount="radiotest" ^ environment.get("RADIO_STATION"), password="...", fallible=true, s)
Previously, I advised you to rebuild liquidsoap without extra dependencies in this message https://github.com/savonet/liquidsoap/issues/4058#issuecomment-2259146171.
May I ask you to do that and also replace the %mp3
encoder with %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
?
The script should look similar to
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
def log_song(m)
file.write(
'/nowplaying/test#{environment.get("RADIO_STATION")}.json',
data="#{metadata.json.stringify(m)}\n"
append=false,
)
end
s = playlist('/music/#{environment.get("RADIO_STATION")}', mode="random", reload=300)
s.on_track(log_song)
s = crossfade(fade_out=3., fade_in=3., duration=5., s)
s = nrj(s)
enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
enc,
s,
mount='radiotest#{environment.get("RADIO_STATION")}',
password="...",
fallible=true
)
I've rebuilt the image without those dependencies and modified the test station to use that image.
After 13 hours, memory usage is lower but not as much as indicated on the issue about leaks related to processing cover art. It's sitting at 290MB where the min/max of the other 4 instances is 295-366MB. It's possible this is just random variance, and it's still climbed quite a bit from the initial usage of ~128MB. I'll let it run another 12 hours and let you know what memory usage looks like then.
Interestingly, CPU usage is about 15% lower on average than the other 4 instances. The ffmpeg-based encoder seems more efficient in this regard, so even if memory usage isn't significantly improved, it may be worth us switching to the ffmpeg encoder just for that.
I think the next step should be to remove crossfade
, nrj
and on_track
.
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
s = playlist('/music/#{environment.get("RADIO_STATION")}', mode="random", reload=300)
enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
enc,
s,
mount='radiotest#{environment.get("RADIO_STATION")}',
password="...",
fallible=true
)
Perhaps the problem here is mode="random"
, which differs from the default randomize mode, combined with reload=300
.
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
s = playlist('/music/#{environment.get("RADIO_STATION")}')
enc = %ffmpeg(format="mp3", %audio(codec="libmp3lame"))
output.icecast(
enc,
s,
mount='radiotest#{environment.get("RADIO_STATION")}',
password="...",
fallible=true
)
After that, you'll probably want to disable the output.
settings.server.telnet.port := int_of_string(environment.get("TELNET_PORT"))
settings.server.telnet := true
settings.decoder.mime_types.ffmpeg.images := []
settings.decoder.file_extensions.ffmpeg.images := []
settings.decoder.decoders := ["ffmpeg"]
settings.request.metadata_decoders := ["ffmpeg"]
s = playlist('/music/#{environment.get("RADIO_STATION")}')
output.dummy(s, fallible=true)
@cdhowie I forgot to mention. You can set the OCaml GC to a more agressive settings to see if the memory is accumulating because the GC is too slow to reclaim memory:
runtime.gc.set(runtime.gc.get().{ space_overhead = 20 })
After 24h without crossfade/nrj/on_track the memory usage again is lower than the minimum of the other 4 instances but only by 10MB, so this is probably random variance.
Now trying with the playlist mode change.
@toots I can throw that in after running this round of tests; I don't want to change two things at once.
@vitoyucepi I tried each suggestion in order and after about 24 hours, none of them resulted in any significant change to memory usage.
I am going to revert to the original script and try @toots' suggestion next.
After doing a whole bunch of testing given the instructions/ideas above, I started digging into more granular memory usage statistics, which I probably should have done a lot sooner (per this comment), to find out that cgroup memory consumption information includes the page cache of the cgroup, which is the opposite of what I understood to be true before. (I expected the page cache to be global across the entire kernel and not attributed to individual cgroups, which is why I did not dig into this sooner. I'm not exactly sure where I acquired this misinformation... maybe this is how it worked with cgroups v1, or a different tool entirely.)
Watching the usage of individual processes and the cgroup as a whole more closely, I found that the cgroup's memory usage seems to peak around ~300MB during startup and then drops to around ~110MB almost immediately. Liquidsoap's RSS settles around 180MB of usage after it's been running for awhile. All of the other usage above 180MB reported by k8s is likely attributable to the page cache holding the audio files in memory.
Doing some further research, I found that a cgroup will begin evicting entries in the page cache once the total memory consumption of the cgroup approaches the cgroup's configured memory limit. All this time I've been testing without a memory limit, as the k8s documentation indicates that a process will be killed if it tries to allocate above the configured memory limit (which I did not want to happen), and I wasn't sure yet where usage would eventually settle. It seems that this backfired, and the absence of a memory limit encouraged the kernel to keep all of the music files on disk in the page cache for as long as possible, which counted against the container as memory usage. (Technically this would not really be a problem in a "traditional" deployment, but within k8s this contributes to memory pressure on the node where the container is deployed; in other words, limiting the page cache's consumption here is a good idea as it allows that memory to be used by other pods.)
I initially tried adding a 256MB memory limit but this failed immediately due to the ~300MB peak consumption during startup, so I raised the limit to 384MB and let it run for a few days, and the results are that the containers are indeed staying within the 384MB usage limit and the RSS of liquidsoap itself is within 180-190MB.
The conclusion is that there does not appear to be any issue with liquidsoap itself. I sincerely apologize for wasting everyone's time, and I very much appreciate all of the time that you all took to investigate this for me. In appreciation of the time you've spent both on this issue and in maintaining liquidsoap in general, my company will be making a donation to the project later today.
Further reading: https://github.com/kubernetes/kubernetes/issues/43916
Thank you so much for your patience and for getting to the bottom of this. This is very valuable information, I've been waiting for this to settle to write that blog post about memory usage!
The memory peak that you observe at startup is due to the script being typechecked. If you are able to utilize the new caching functionality, this should make it possible to skip this step and set a more tightened memory limit.
Thank you so much for your patience
I feel like I should be saying that to you! 🙂
The memory peak that you observe at startup is due to the script being typechecked. If you are able to utilize the new caching functionality, this should make it possible to skip this step and set a more tightened memory limit.
Oh, that's very good to know. All of our stations use exactly the same script but with different supplied environment variables, so we should be able to create an emptyDir
volume to mount to LIQ_CACHE_USER_DIR
, and populate that cache with an initContainer
once for all of the stations.
However, we will probably need to wait until the 2.3 release to implement this as the segfault we experienced trying the 2.3 branch makes me very hesitant to use this in production just yet.
Finally wrote that blog post! Link: https://www.liquidsoap.info/blog/2024-10-26-a-faster-liquidsoap-part-deux/
Finally wrote that blog post! Link: https://www.liquidsoap.info/blog/2024-10-26-a-faster-liquidsoap-part-deux/
Nice in depth blog post 👍🏻
Description
2.2.5 appears to have some kind of memory leak. I notice that the process starts by using about 500MB RAM (which is already a bit high IMO considering that our script doesn't do that much) and over the course of the next 24 hours grows to 1GB RAM usage. We've had to specify a hard memory limit for the container to keep it from affecting other services, which causes the container to restart about every 24h when the orchestrator notices that it's using too much RAM.
I rebuilt with jemalloc support to see if this helps and it does not.
Steps to reproduce
We are using the following script:
Expected behavior
Memory usage should remain stable.
Liquidsoap version
Liquidsoap build config
Installation method
From OPAM
Additional Info
The Dockerfile we built our image from is based on the one from this repository: https://github.com/pltnk/docker-liquidsoap
We had to modify it as it doesn't support 2.2.5 due to Ubuntu Focal having a too old version of OPAM. Here is the Dockerfile we are using to build: