pret / pokecrystal

Disassembly of Pokémon Crystal
https://pret.github.io/pokecrystal/
2.07k stars 778 forks source link

Fix segfaults at EOF for scan_includes, and add token/newline checks #1068

Closed Lorenzooone closed 1 year ago

Lorenzooone commented 1 year ago

@aaaaaa123456789 first noticed this for comments, but there are multiple.

mid-kid commented 1 year ago

Can you give an example of the situations you were up against, where this was a significant issue?

Nobody likes this processor, and IMO it ideally wouldn't be used at all, but since we have it, it's kept as simple as possible within reason, in order to make it fast and easier to think about, avoiding weird logic bugs that are later hard to fix as rgbds syntax evolves. The logic you introduce here isn't inmediately obvious to me, making me a bit nervous.

Lorenzooone commented 1 year ago

So, the segfaults themselves can happen because when ptr is set to 0 from the strchr, then the "for statement" increments ptr, making it so if(ptr) is true regrdless. This means that you end up reading from bad addresses. The gotos quit the for loop once nothing can be done.

For the include/incbin tokenization, while talking to @Rangi42 about the segfaults themselves, it came to light that having macros like "include_size" (or the opposite) was problematic, so tokenization should fix that.

Lastly, I made it so if there is a newline after an incbin/include and before a ", the process of searching for a " is interrupted, as it could be problematic (Rangi was also talking about that when discussing include_size).

Hope this clarifies everything!

If we were stricter, I should also make it so encountering a ; interrupts the " search process. It is something I could easily add.

Rangi42 commented 1 year ago

Can you give an example of the situations you were up against, where this was a significant issue?

@mid-kid The original minimal fix in Prism was by @aaaaaa123456789 :

            ptr = strchr(ptr, '\n');
            if (!ptr) {
                fprintf(stderr, "%s: no newline at end of file\n", filename);
+               free(contents);
+               return;
            }
            break;

And had this explanation:

If the last line of a file is a comment, and the file doesn't end in a newline, then strchr(ptr, '\n') returns a null pointer. Since break only breaks out of the switch, not the outer for loop, this leads to undefined behavior in the loop test, since ptr - contents is doing pointer arithmetic on a null pointer.

There was some further discussion about the other changes but I'm not clear on which ones are essential for avoiding segfaults in actual input files, which ones are for the sake of different/better error messages, and which ones are just refactoring. @Lorenzooone maybe you can clarify?

Anyway, in theory what we'd want is to use this regex (case-insensitive):

# Match INCLUDE or INCBIN with an optional label before it.
# Don't support edge cases like /* */ comments midway, or line continuations,
# or STR*() functions, or triple-quoted strings, or other rgbasm fanciness.
^
\s*
(?:
    [A-Z_][A-Z0-9_]*::?           # global label with one or two colons
  | \.[A-Z_][A-Z0-9_]*(?:\s|::?)  # local label with one or two colons or space
  | :                             # anonymous label
)?
\s*
INC(?:LUDE|BIN)
\s*
"([^"]+)"

This would avoid bugs on code like this, which tries to include "Hello world":

MACRO include_footprint_top
  INCBIN \1, 0, 2 * LEN_1BPP_TILE
  println "hello world"
ENDM
  include_footprint_top "gfx/footprints/bulbasaur.1bpp"
  include_footprint_top "gfx/footprints/ivysaur.1bpp"
Lorenzooone commented 1 year ago

This is the anti-segfaults part: https://github.com/pret/pokecrystal/pull/1068/commits/ee66031029147901bd94ad8c22713bf2797752b5 This alone already fixes them. Both the one found by ax6, and the ones that are in the other switch cases.

(There is also a ++ being converted to a +1, but that's not important)

Rangi42 commented 1 year ago

@Lorenzooone Thanks for pointing out the multiple commits. I got rid of the stderr reporting for "comment without newline at EOF", "unterminated string", and "INCLUDE/INCBIN without file path", because none of them are strictly relevant to scan_includes' functionality. I also did some refactoring, reformatting, and added comments, which I hope clarify the parsing logic.

Lorenzooone commented 1 year ago

Looks good to me! You may also want to have special handling for INC(LUDE/BIN) followed by ; before a ", as comments should be treated in a special way. But I don't think it's a realistic case. Just something which may be good for consistency.

Rangi42 commented 1 year ago

So, I added back the "not strictly relevant" warning messages along with your new "no file path" one, because I hadn't realized they were already present and want to keep this change minimal.

You may also want to have special handling for INC(LUDE/BIN) followed by ; before a ", as comments should be treated in a special way. But I don't think it's a realistic case. Just something which may be good for consistency.

Should they? Wouldn't this be appropriately covered by the "no file path after INCLUDE" warning?

INCLUDE "foo.asm"
SomeLabel: INCLUDE ; lol "this" is commented
INCLUDE "bar.asm"
Lorenzooone commented 1 year ago

So, I added back the "not strictly relevant" warning messages along with your new "no file path" one, because I hadn't realized they were already present and want to keep this change minimal.

You may also want to have special handling for INC(LUDE/BIN) followed by ; before a ", as comments should be treated in a special way. But I don't think it's a realistic case. Just something which may be good for consistency.

Should they? Wouldn't this be appropriately covered by the "no file path after INCLUDE" warning?

INCLUDE "foo.asm"
SomeLabel: INCLUDE ; lol "this" is commented
INCLUDE "bar.asm"

Not with the current logic, from what I can see. As the tokenization just looks at the char right after, not at what it instantly finds.

Rangi42 commented 1 year ago

Not with the current logic, from what I can see. As the tokenization just looks at the char right after, not at what it instantly finds.

Hmm. Can you give an example input file that you think should have this special handling?