savonet / liquidsoap

Liquidsoap is a statically typed scripting general-purpose language with dedicated operators and backend for all thing media, streaming, file generation, automation, HTTP backend and more.
http://liquidsoap.info
GNU General Public License v2.0
1.39k stars 128 forks source link

String functions UTF-8 issue #4109

Closed DigiBC closed 3 weeks ago

DigiBC commented 3 weeks ago

Description

The string functions only seem to work well with ASCII characters but not with Unicode characters (UTF-8). Unicode characters are counted twice or three times, which leads to incorrect calculations.

Steps to reproduce

The issue is reproducible with the official package 2.2.5 (and earlier) and with the Rolling Release 2.3.x (tested with build 4980bc0).

Here are some examples using Liquidsoap's interactive interpreter:

ASCII (working correctly) string.length("e");; - : int = 1 string.length("o");; - : int = 1 string.length("~");; - : int = 1

UTF-8 string.length("é");; - : int = 2 string.length("ö");; - : int = 2 string.length("€");; - : int = 3

Here is a more practical example of extracting the name of a song as substring. If the band name is in ASCII characters, the result is correct: string.sub("Queensryche - Silent lucidity (live)", start=14, length=15);; - : string = "Silent lucidity"

In the original notation of the band name, which contains an UTF-8 special character, the result is shifted by one character: string.sub("Queensrÿche - Silent lucidity (live)", start=14, length=15);; - : string = " Silent lucidit"

Expected behavior

Each UTF-8 character should only be counted as one character so that calculations with string functions generate correct results.

Liquidsoap version

Liquidsoap 2.3.0+git@4980bc075
Copyright (c) 2003-2024 Savonet team
Liquidsoap is open-source software, released under GNU General Public License.
See <http://liquidsoap.info> for more information.

Liquidsoap build config

* Liquidsoap version  : 2.3.0+git@4980bc075

 * Compilation options
   - Release build       : false
   - Git SHA             : 4980bc075
   - OCaml version       : 4.14.2
   - OS type             : Unix
   - Libs versions       : alsa=0.3.0 angstrom=0.16.0 ao=0.2.4 asetmap=0.8.1 asn1-combinators=0.2.6 astring=0.8.5 base=v0.16.3 base.base_internalhash_types=v0.16.3 base.caml=v0.16.3 base.shadow_stdlib=v0.16.3 base64=3.5.1 bigarray=[distributed with Ocaml] bigarray-compat=1.1.0 bigstringaf=0.9.1 bjack=0.1.6 bos=0.2.1 bytes=[distributed with OCaml 4.02 or above] ca-certs=v0.2.3 camlp-streams camomile.lib=2.0 cohttp=5.3.1 cohttp-lwt=5.3.0 cohttp-lwt-unix=5.3.0 conduit=6.2.3 conduit-lwt=6.2.3 conduit-lwt-unix=6.2.3 cry=1.0.3 cstruct=6.2.0 ctypes=0.22.0 ctypes-foreign=0.22.0 ctypes.stubs=0.22.0 curl=0.9.2 domain-name=0.4.0 domain_shims dssi=0.1.5 dtools=0.4.5 dune-build-info=3.16.0 dune-private-libs.dune-section=3.16.0 dune-site=3.16.0 dune-site.private=3.16.0 duppy=0.9.4 eqaf=0.9 eqaf.bigstring=0.9 eqaf.cstruct=0.9 faad=0.5.2 fdkaac=0.3.3 ffmpeg-av=1.2.0 ffmpeg-avcodec=1.2.0 ffmpeg-avdevice=1.2.0 ffmpeg-avfilter=1.2.0 ffmpeg-avutil=1.2.0 ffmpeg-swresample=1.2.0 ffmpeg-swscale=1.2.0 fileutils=0.6.4 flac=0.5.1 flac.decoder=0.5.1 flac.ogg=0.5.1 fmt=0.9.0 fpath=0.7.3 frei0r=0.1.2 gd=1.1 gen=1.1 gmap=0.3.0 hkdf=1.0.4 inotify=2.0-62-g5e58536 integers ipaddr=5.6.0 ipaddr-sexp=5.6.0 ipaddr.unix=5.6.0 irc-client irc-client-unix ladspa=0.2.2 lame=0.3.7 lastfm=0.3.4 lilv=0.2.0 liquidsoap-lang=2.3.0 liquidsoap-lang.console=2.3.0 liquidsoap_alsa=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ao=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_bjack=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_builtins=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_core=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_dssi=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_faad=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_fdkaac=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ffmpeg=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_flac=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_frei0r=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_gd=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_irc=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ladspa=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_lame=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_lastfm=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_lilv=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_lo=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_mad=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ogg=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ogg_flac=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_optionals=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_opus=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_osc=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_oss=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_portaudio=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_posix_time=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_prometheus=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_pulseaudio=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_runtime=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_samplerate=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_sdl=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_sdl_log_level=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_shine=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_soundtouch=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_speex=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_sqlite=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_srt=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_ssl=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_stdlib=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_stereotool=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_theora=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_tls=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_vorbis=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_xmlplaylist=rolling-release-v2.3.x-2-g4980bc0 liquidsoap_yaml=rolling-release-v2.3.x-2-g4980bc0 lo=0.2.0 logs=0.7.0 logs.fmt=0.7.0 logs.lwt=0.7.0 lwt=5.7.0 lwt.unix=5.7.0 macaddr=5.6.0 mad=0.5.3 magic-mime=1.3.1 mem_usage=0.1.1 memtrace=0.2.3 menhirLib=20231231 metadata=0.3.0 mirage-crypto=0.11.3 mirage-crypto-ec=0.11.3 mirage-crypto-pk=0.11.3 mirage-crypto-rng=0.11.3 mirage-crypto-rng.unix=0.11.3 mm=0.8.5 mm.audio=0.8.5 mm.base=0.8.5 mm.image=0.8.5 mm.midi=0.8.5 mm.video=0.8.5 ocplib-endian ocplib-endian.bigstring ogg=0.7.4 ogg.decoder=0.7.4 opus=0.2.3 opus.decoder=0.2.3 osc osc-unix parsexp=v0.16.0 pbkdf portaudio=0.2.3 posix-base=5a7f328 posix-socket=5a7f328 posix-socket.constants=5a7f328 posix-socket.stubs=5a7f328 posix-socket.types=5a7f328 posix-time2=5a7f328 posix-time2.constants=5a7f328 posix-time2.stubs=5a7f328 posix-time2.types=5a7f328 posix-types=5a7f328 posix-types.constants=5a7f328 ppx_compare.runtime-lib=v0.16.0 ppx_hash.runtime-lib=v0.16.0 ppx_sexp_conv.runtime-lib=v0.16.0 prometheus=1.2 prometheus-app=1.2 ptime=1.1.0 ptime.clock.os=1.1.0 pulseaudio=0.1.6 re=1.11.0 result=1.5 rresult=0.7.0 samplerate=0.1.7 saturn_lockfree=0.4.1 sedlex=3.2 seq=[distributed with OCaml 4.07 or above] sexplib=v0.16.0 sexplib0=v0.16.0 shine=0.2.3 soundtouch=0.1.9 speex=0.4.2 speex.decoder=0.4.2 sqlite3=5.1.0 srt=0.3.1 srt.constants=0.3.1 srt.stubs=0.3.1 srt.stubs.locked=0.3.1 srt.types=0.3.1 ssl=0.7.0 stdlib-shims=0.3.0 stereotool=rolling-release-v2.3.x-2-g4980bc0 str=[distributed with Ocaml] stringext=1.6.0 theora=0.4.1 theora.decoder=0.4.1 threads=[distributed with Ocaml] threads.posix=[internal] tls=0.17.4 tsdl=v1.0.0 tsdl-image=0.5 tsdl-ttf=0.6 unix=[distributed with Ocaml] unix-errno=52c6ecb unix-errno.errno_bindings=52c6ecb unix-errno.errno_types=52c6ecb unix-errno.errno_types_detected=52c6ecb unix-errno.unix=52c6ecb uri=4.4.0 uri-sexp=4.4.0 uri.services=4.4.0 vorbis=0.8.1 vorbis.decoder=0.8.1 x509=0.16.5 xmlm=1.4.0 xmlplaylist=0.1.5 yaml=3.2.0 yaml.bindings=3.2.0 yaml.bindings.types=3.2.0 yaml.c=3.2.0 yaml.ffi=3.2.0 yaml.types=3.2.0 zarith=1.13
   - architecture        : amd64
   - host                : x86_64-pc-linux-gnu
   - target              : x86_64-pc-linux-gnu
   - system              : linux
   - ocamlopt_cflags     : -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC
   - native_c_compiler   : gcc -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -D_FILE_OFFSET_BITS=64
   - native_c_libraries  : -lm

 * Configured paths
   - mode              : posix
   - standard library  : /usr/share/liquidsoap/libs
   - scripted binaries : /usr/share/liquidsoap/bin
   - rundir            : /var/run/liquidsoap
   - logdir            : /var/log/liquidsoap
   - user cache        : $HOME/.cache/liquidsoap (override with $LIQ_CACHE_USER_DIR)
   - system cache      : /var/cache/liquidsoap (override with $LIQ_CACHE_SYSTEM_DIR)
   - camomile files    : /usr/share/liquidsoap/camomile

 * Supported input formats
   - MP3               : yes
   - AAC               : yes
   - Ffmpeg            : yes
   - Flac (native)     : yes
   - Flac (ogg)        : yes
   - Opus              : yes
   - Speex             : yes
   - Theora            : yes
   - Vorbis            : yes
   - WAV/AIFF          : yes (native)

 * Supported output formats
   - FDK-AAC           : yes
   - FFmpeg            : yes
   - MP3               : yes
   - MP3 (fixed-point) : yes
   - Flac (native)     : yes
   - Flac (ogg)        : yes
   - Opus              : yes
   - Speex             : yes
   - Theora            : yes
   - Vorbis            : yes
   - WAV/AIFF          : yes (native)

 * Tags
   - AAC               : yes
   - FFmpeg            : yes
   - FLAC (native)     : yes
   - Flac (ogg)        : yes
   - Native decoder    : yes
   - Vorbis            : yes

 * Input / output
   - ALSA              : yes
   - AO                : yes
   - FFmpeg            : yes
   - JACK              : yes
   - OSS               : yes
   - Portaudio         : yes
   - Pulseaudio        : yes
   - SRT               : yes

 * Audio manipulation
   - FFmpeg            : yes
   - LADSPA            : yes
   - Lilv              : yes
   - Samplerate        : yes
   - SoundTouch        : yes
   - StereoTool        : yes

 * Video manipulation
   - camlimages        : no (requires camlimages)
   - FFmpeg            : yes
   - frei0r            : yes
   - ImageLib          : no (requires imagelib)
   - SDL               : yes

 * MIDI manipulation
   - DSSI              : yes

 * Visualization
   - GD                : yes
   - Graphics          : no (requires graphics)
   - SDL               : yes

 * Additional libraries
   - FFmpeg filters    : yes
   - FFmpeg devices    : yes
   - inotify           : yes
   - irc               : yes
   - jemalloc          : no (requires jemalloc)
   - lastfm            : yes
   - lo                : yes
   - memtrace          : no (requires memtrace)
   - osc               : yes
   - ssl               : yes
   - sqlite3           : yes
   - tls               : yes
   - posix-time2       : yes
   - windows service   : no (requires winsvc)
   - YAML support      : yes
   - XML playlists     : yes

 * Monitoring
   - Prometheus        : yes

Installation method

From official packages in the release artifacts

Additional Info

Tested with Debian 12 and Ubuntu 24.04 (AMD64).

toots commented 3 weeks ago

Thanks, that was a great suggestion! #4111 should take care of it.