matt-kempster / m2c

A MIPS and PowerPC decompiler.
GNU General Public License v3.0
386 stars 46 forks source link

Should non-ASCII contents in .asci blocks be rejected? #257

Open Fuuzetsu opened 1 year ago

Fuuzetsu commented 1 year ago

I've made an issue at https://github.com/Decompollaborate/spimdisasm/issues/121 which shows that there are sometimes .asciz blocks emitted with non-ASCII characters in them.

m2c then happily takes them and outputs bytes from UTF-8 encoding (as far as I can tell).

That is, given asciz "奩" it will happily spit out e5 a5 a9 00. A brief grep brings me to parse_ascii_directive which at the very top says something about being wrong w.r.t. encodings: I guess this is exactly the issue it's talking about? I see few lines later a very explicit c.encode("utf-8"). Is this what MIPS assemblers would usually do or would they just interpret the whole input as ASCII to start with? I don't know.

Fuuzetsu commented 1 year ago

It was pointed out to me in linked ticket that this is a valid use-case of .asciz. So it's presumably also a valid use-case for m2c.

I guess however it's not the end of the story: m2c should presumably be aware of the original encoding.

So:

So I guess first m2c would have to make the encoding more flexible and then splat would need to make use of the new flag?

simonlindholm commented 1 year ago

I think assemblers typically read asm byte by byte; m2c should perhaps be refactored to do the same. I think the current behavior here isn't too bad though: for ASCII and UTF-8 it correctly preserves bytes, while for anything else it will cleanly fail with an error. And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x escapes for offending string literals. (Otherwise random tools will break, like git and decomp.me.) I suspect the comment in parse_ascii_directive may be wrong and the code actually correct.

The other place where string encodings come up is in the output layer, for string literals. Here we decode bytes as UTF-8 with fallback to \x escapes, which isn't necessarily correct, though non-UTF-8 successfully parsing as UTF-8 at least ought to be fairly rare... After this we have Python printing to stdout, the stdout copied into source code somehow, maybe git transforming the encoding in the process, and then the compiler eventually compiling the source, each step of which could result in encoding bugs. (But with UTF-8 it ought to be mostly reliable.)

So I think I could possibly be convinced to take a PR that adds a string encoding cmdline flag that works like you describe if there's a concrete use case, but I'm also pretty happy with the current behavior. A flag just for disabling the "test if it's UTF-8" behavior of the string literal formatting may be a better choice, if anything.

Fuuzetsu commented 1 year ago

And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x escapes for offending string literals.

Normally I'd agree but all this started with splatting out a 25+ year old game so if the original strings in that are not UTF-8 (in my case it turned out that they were so that's nice), there's not much choice, right? Unless whatever original encoding gets processed and then output as UTF-8 or something so that everything downstream just works?

simonlindholm commented 1 year ago

There is choice. You can either \x-escape non-ASCII chars in your strings to keep all .s in ASCII (maybe with strings transcoded into UTF-8 in a comment on the same line for ease of understandability), or do some fancy transcode-on-the-fly as part of the build chain (https://github.com/simonlindholm/asm-processor enables doing this for N64 games).