Open Fuuzetsu opened 1 year ago
It was pointed out to me in linked ticket that this is a valid use-case of .asciz
. So it's presumably also a valid use-case for m2c
.
I guess however it's not the end of the story: m2c
should presumably be aware of the original encoding.
So:
m2c
presumably needs some flag and encoding support that it uses in parse_ascii_directive
.splat
needs to use the same string_encoding
for m2c
as it used for spimdisasm
.So I guess first m2c
would have to make the encoding more flexible and then splat
would need to make use of the new flag?
I think assemblers typically read asm byte by byte; m2c should perhaps be refactored to do the same. I think the current behavior here isn't too bad though: for ASCII and UTF-8 it correctly preserves bytes, while for anything else it will cleanly fail with an error. And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x
escapes for offending string literals. (Otherwise random tools will break, like git and decomp.me.) I suspect the comment in parse_ascii_directive may be wrong and the code actually correct.
The other place where string encodings come up is in the output layer, for string literals. Here we decode bytes as UTF-8 with fallback to \x
escapes, which isn't necessarily correct, though non-UTF-8 successfully parsing as UTF-8 at least ought to be fairly rare... After this we have Python printing to stdout, the stdout copied into source code somehow, maybe git transforming the encoding in the process, and then the compiler eventually compiling the source, each step of which could result in encoding bugs. (But with UTF-8 it ought to be mostly reliable.)
So I think I could possibly be convinced to take a PR that adds a string encoding cmdline flag that works like you describe if there's a concrete use case, but I'm also pretty happy with the current behavior. A flag just for disabling the "test if it's UTF-8" behavior of the string literal formatting may be a better choice, if anything.
And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x escapes for offending string literals.
Normally I'd agree but all this started with splatting out a 25+ year old game so if the original strings in that are not UTF-8 (in my case it turned out that they were so that's nice), there's not much choice, right? Unless whatever original encoding gets processed and then output as UTF-8 or something so that everything downstream just works?
There is choice. You can either \x
-escape non-ASCII chars in your strings to keep all .s in ASCII (maybe with strings transcoded into UTF-8 in a comment on the same line for ease of understandability), or do some fancy transcode-on-the-fly as part of the build chain (https://github.com/simonlindholm/asm-processor enables doing this for N64 games).
I've made an issue at https://github.com/Decompollaborate/spimdisasm/issues/121 which shows that there are sometimes
.asciz
blocks emitted with non-ASCII characters in them.m2c
then happily takes them and outputs bytes from UTF-8 encoding (as far as I can tell).That is, given
asciz "奩"
it will happily spit oute5 a5 a9 00
. A briefgrep
brings me toparse_ascii_directive
which at the very top says something about being wrong w.r.t. encodings: I guess this is exactly the issue it's talking about? I see few lines later a very explicitc.encode("utf-8")
. Is this what MIPS assemblers would usually do or would they just interpret the whole input as ASCII to start with? I don't know.