mchehab / zbar

ZBar is an open source software suite for reading bar codes from various sources, including webcams. As its development stopped in 2012, I took the task of keeping it updated with the V4L2 API. This is the main repository for it. There's a clone at at LinuxTV.org, and another one at gitlab.
https://linuxtv.org/downloads/zbar/
GNU Lesser General Public License v2.1
993 stars 206 forks source link

zbar incorrectly detects iso-8859-1 encoded QRCodes as big5 on musl libc #281

Open mgorny opened 9 months ago

mgorny commented 9 months ago

When running on musl libc, segno incorrectly detects iso-8859-1 encoded QRcodes as "big5". I've originally noticed this through a test failure in segno package.

An example QRcode file is: test

On a glibc system this is decoded correctly:

$ zbarimg test.png 
QR-Code:Märchenbücher
scanned 1 barcode symbols from 1 images in 0 seconds

On a musl system, it gets decoded as:

$ zbarimg test.png 
QR-Code:M酺chenb𡡷her
scanned 1 barcode symbols from 1 images in 0 seconds

From debugging, I've established that the problem lies in zbar trying big5 first, and expecting iconv() to fail for this string, as it does on glibc:

$ iconv -f utf8 -t iso-8859-1 <<<'Märchenbücher' | iconv -f big5 -t utf8
M酺chenbiconv: illegal input sequence at position 8

However, it doesn't fail on musl libc:

$ iconv -f utf8 -t iso-8859-1 <<<'Märchenbücher' | iconv -f big5 -t utf8
M酺chenb𡡷her

Confirmed with zbar as of a549566ea11eb03622bd4458a1728ffe3f589163, musl 1.2.3 (Gentoo) and 1.2.4_git20230717 (Alpine).

mgorny commented 9 months ago

Apparently the difference is that glibc rejects codes for "user-defined" Big5 characters, where musl uses them. If I shorten the string to Märchen, I can reproduce the same problem on a glibc system.

tormodvolden commented 2 months ago

Just to spell it out (please correct me if I am wrong), "ä" (a with umlaut) is 0xe4 in iso-8859-1, which is a valid start byte for Big5 (in the "Less frequently used characters" set). If the following byte is 0x40-0x7e or 0xa1-0xfe, it can be a valid Big5, so e.g. "är" will pass as big5, whereas "ä." (or a string ending with "ä") will fail.

So zbar will favour Big5 in such cases although it should have favoured iso-8859-1 which is the default for QR codes per the standard.

"ü" (u with umlaut) is 0xfc in iso-8859-1, which is a valid start byte for Big5 in the "Reserved for user-defined characters" set. Which fails on glibc but passes as Big5 on musl (independently of the following byte?).

tormodvolden commented 2 months ago

And if we were to ignore Big5, "är" would pass as valid SJIS which zbar currently favours over iso-8859-1.

tormodvolden commented 2 months ago

The wrong detection as big5 is also reported in #212.