lubo / zxinglight

A simple wrapper for ZXing C++
MIT License
8 stars 3 forks source link

Binary data. #10

Open kousu opened 5 years ago

kousu commented 5 years ago

The QR spec is supposed to support binary data, but there are this does not currently work. Not even zxing-cpp can handle this: $ zxing --more --verbose tests/fixtures/amen-02.png Hybrid binarizer failed: zxing::ReaderException: No code detected Global binarizer failed: zxing::ReaderException: No code detected

so before this can be committed, zxing-cpp needs to be repaired, or we need to do #7 first.

The test file is from https://sampleswap.org//samples-ghost/DRUM%20LOOPS%20and%20BREAKS/161%20to%20180%20bpm/128[kb]161_amenvar3.aif.mp3 and it should be public domain, plus it is an extremely short sample in any case which should make it fall under fair use no matter what.

kousu commented 5 years ago

This depends on https://github.com/glassechidna/zxing-cpp/pull/80 getting merged.

Then, zxinglight needs to not use ::c_str() to get the data out. It needs to use ::size() and ::data().

kousu commented 5 years ago

I've written a patch but it should be considered a first draft. I'd like some feedback on the shape of the API (re: https://github.com/lubo/zxinglight/issues/4#issuecomment-478388365): should it have two output variables side by side, one for text and one for binary, like nu-book's and jsQR? Should it only be one?

🥳EDIT🥳: wtf? how did that test pass? @glassechidna hasn't merged my PR yet. 🤔

I think it would be nicest and most pythonic if it could only be one. The easiest flow, I think, is if you scan a textual code you get str and if you scan a binary code you get bytes, and you can check which it is just by looking at the type. This is complicated though: you can guess if binary is not text (eg. if it has nulls, or invalid UTF-8 sequences) but you can't be 100% sure (well, except for the null check, that's a dead giveaway), but you can't tell if binary is text for sure. We could tell the reader what type to expect, but that's unlikely to work.

The really tricky thing is that QR codes can mix character sets inline, so the decoder must pay attention as it runs and either canonicalize everything into a master character set (in practice, Unicode) or keep the chunks in separate pieces with their character sets marked. jsQR does both: it puts everything into JS strings (which are..UCS-2, I think?), transcoding them from UTF-8, but if that fails it leaves a blank in the .text output, but it records the source .chunks[]. Over in https://github.com/cozmo/jsQR/issues/129#issue-427838457 I suggest that we just declare as a community that UTF-8 is the only way to encode multilingual text (or any kind of text; it's identical with ISO8859-1 so assuming UTF-8 should be compatible with any ancient QR codes ). This solution would break the spec, but maybe this is a case where the spec needs to change. If we adopted that solution, zxinglight could return bytes up to python, try decoding it as UTF-8 and, if that fails, output the bytes instead.