readhexstring has multiple problems

PertinentDetail commented 2 years ago

Unfortunately, readhexstring just doesn't work. It will fail any test you throw at it.

I have a lot of sympathy for this bug. The main problem is one of those blatant issues that's ridiculously hard to spot until you end up tracing the code line by line and see it go the wrong way at a branch. Even knowing readhexstring wasn't working, I stared at this code for over an hour and I couldn't see the problem.

readhexstring is supposed to skip non-hex characters and process only hex characters. The loops that do this look like:

        do
        {
            c[0] = xpost_file_getc(f);
            if (c[0] == EOF) ++eof;
        } while(!eof && strchr(hex, c[0]) != NULL);

The variable hex correctly contains all upper and lower case hex digits.

Ignoring the end of file handling (which I'll get to in a sec), this simplifies to:

        do
        {
            c[0] = xpost_file_getc(f);
        } while(strchr(hex, c[0]) != NULL);

strchr returns non-NULL if its second argument (a character) is contained in its first argument (a string). So this loop will keep looping as long as the character being read is a hex digit and will exit only when it reads a non-hex digit. This is exactly the reverse of what the test should be.

Flipping the != to == fixes the core problem, but there are others: the end of file handling has a couple of bugs too.

The first is that if we hit end of file between characters (that is, between pairs of hex digits) we'll always return one character too many.

There are no early exits (break, goto or return) in the main for loop. So, if we're not already at end of file at the start of the loop then we're guaranteed to hit the line that writes to s[n] and the loop increment (n++). This will happen even if we hit end of file before reading a hex digit into c[0].

It's easiest to see this by considering the case where readhexstring is called when we're already at the end of the file (although we don't know that yet as we haven't read the EOF character). Before the for loop, eof is initialised to 0 which guarantees the for loop will be run at least once and thus we'll return a single character string. Here's a simplification of the code for that case:

    int eof = 0;
    …
    for (n = 0; !eof && n < S.comp_.sz; n++)
    {
        do  { … no gotos … } while(!eof && …);              [exit do-while loop if we get to EOF but stay in for]
        if (!eof)
            … not taken if we're now at EOF …
        else
        {
            c[1] = '0';
        }
        s[n] = ((strchr(hex, toupper(c[0])) - hex) << 4)    [c[0] is EOF so we calculate NULL - hex]
             | (strchr(hex, toupper(c[1])) - hex);
    }                                                       [and we'll execute the n++ in the for before exiting]
    S.comp_.sz = n;                                         [so, we'll set .sz to 1]

The second end of file issue is that the logic for handling end of file between the first and second hex digits is wrong. It looks like the code is trying to pad the string with a '0' if there are an odd number of hex digits. Although this is the right behaviour for hex string literals (such as <901fa>) it doesn't appear to be correct for readhexstring. Even if this were the right behaviour, the logic is wrong. The code sets the second digit (c[1]) to '0' if end of file was encountered when reading the first digit (c[0]), when it should be doing it if the first digit was read normally (no end of file) but end of file was encountered when reading the second digit.

A simple test is:

{currentfile 3 string readhexstring exch print not {exit} if} loop
54657374696e670a

This should output "Testing\n"

Here are Try it online! links for that test, for xpost (which fails) and Ghostscript (which passes). Both these links run the output through od -c so that non-printable characters are clearly visible.

I've attached a more complex test with more non-hex characters. I've had to give it a .txt extension for upload to GitHub: readhexstring.txt. You'll need to add another test for input with an odd number of hex digits.

luser-dr00g commented 2 years ago

Thanks for digging into this. It seems to me that the whole code could probably be simpler overall since it doesn't need to worry about partial digits like the <hex> syntax does.

for( n=0; !eof && n < S.comp_.sz; n++ ){
    do (c[0] = xpost_file_getc(f)) == EOF && ++eof; while( ! (eof || strchr(hex, c[0])) );
    if( !eof ){
        do (c[1] = xpost_file_getc(f)) == EOF && ++eof; while( ! (eof || strchr(hex, c[1])) );
        if( !eof ) s[n] = ((strchr(hex, toupper(c[0])) - hex) << 4)
                            |  (strchr(hex, toupper(c[1])) - hex);
    }
}
S.comp_.sz = n;

This version seems more correct given your analysis but I'll put it through some tests before committing. And probably the style I have here is too compact per our style rules. The do...while loops could probably be factored into a helper function with a descriptive name.

luser-dr00g commented 2 years ago

Oops. This still has the n++ problem you already diagnosed.

luser-dr00g commented 2 years ago

Take 2.

int readhexdigit( Xpost_File *f, int *p ){
    int eof = 0;
    do (*p = xpost_file_getc(f)) == EOF && ++eof;
    while( ! (eof || strchr(hex, *p)) );
    return eof;
}

for( n = 0; !eof && n < S.comp_.sz; n++ ){
    eof += readhexdigit(f, &c[0]);
    if( !eof ) eof += readhexdigit(f, &c[1]);
    if( eof ) break;
    s[n] = ((strchr(hex, toupper(c[0])) - hex) << 4) | (strchr(hex, toupper(c[1])) - hex);
}
S.comp_.sz = n;

luser-dr00g commented 2 years ago

Written, tested, committed, pushed.

luser-dr00g / xpost

readhexstring has multiple problems #48