Unicode input with two bytes gets truncated to 1 byte, losing the most significant bits

msmilkshake commented 4 years ago

I am working on the Beta project HyperMetro, Stage 3 / 6 and here it is introduced a file with Unicode chars with hex values past 0xFF, example: ř with hex value 0x159, or binary 0b00000001_01011001.

When running the tests, I noticed some strange behavior, as the test executed this line: main.execute("/connect \"Linka C\" \"I.P.Pavlova\" \"Linka A\" \"Petřiny\""); and my program actually received this input: "/connect "Linka C" "I.P.Pavlova" "Linka A" "PetYiny"

After some digging over your Testing Framework I found that in the class org.hyperskill.hstest.dynamic.input.SystemInMock, that the real output passed in the execute() method gets converted to a byte, for the length of the string, so any char that goes beyond 0xFF ends up losing it's leftmost byte. Look at a screenshot with that just about to happen for the example char ř:

idea64_Vxkrc6okCO

the variable c, with the decimal value 345, or 0x159 is about to be truncated into a single byte, resulting in the number 89 (Uppercase Y) and corrupt the input, and resulting in the tests failing.

I created a fork to work on this issue, but I don't have enough expertise in how to test it on my own,because I'm just a beginner, but here's what I changed (Not tested)

@Override
public int read(byte[] b, int off, int len) throws IOException {
    if (len == 0) {
        return 0;
    }
    int c = read();
    if (c == -1) {
        return -1;
    }
    b[off] = (byte) (c & 0xFF);
    b[off + 1] = (byte) ((c >> 8) & 0xFF);
    int i = 2;
    try {
        for (; i < len * 2; i += 2) {
            if (c == '\n') {
                break;
            }
            c = read();
            if (c == -1) {
                break;
            }
            b[off + i] = (byte) (c & 0xFF);
            b[off + i + 1] = (byte) ((c >> 8) & 0xFF);
        }
    } catch (IOException ignored) {
    }
    return i;
}

Glad to be of any help to you.

aaaaaa2493 commented 4 years ago

@msmilkshake Hi! That's an interesting problem. Does your solution work on your code meaning with this fix do you actually get correct input /connect \"Linka C\" \"I.P.Pavlova\" \"Linka A\" \"Petřiny\" ?

msmilkshake commented 4 years ago

@aaaaaa2493 no, it didn't work. I tried several approaches, but I am just too newbie to come to a proper solution. But all chars should be interpreted as two bytes, at least for the problem not to be a problem.

msmilkshake commented 4 years ago

Maybe a complete rethink of the whole input / output Mocks can be the key to fix this...

msmilkshake commented 4 years ago

@aaaaaa2493 could you add me on discord? HeyMilkshake#0270 I have some code that is very close to the solution of this, but it ends up displaying wrong encoded chars..

aaaaaa2493 commented 4 years ago

@msmilkshake I added you on Discord

aaaaaa2493 commented 4 years ago

Fixed in #101

hyperskill / hs-test

Unicode input with two bytes gets truncated to 1 byte, losing the most significant bits #100