vnmakarov / mir

A lightweight JIT compiler based on MIR (Medium Internal Representation) and C11 JIT compiler and interpreter based on MIR
MIT License
2.24k stars 145 forks source link

Add MIR_scan_string_s #341

Open iacore opened 1 year ago

iacore commented 1 year ago

Pass all test.

I'm not sure what the best API would be. At least MIR_scan_string_s should be safer.

More improvement opportunities: Once it is sure that no \0 is in input string, maybe other code can be simplified too?

vnmakarov commented 1 year ago

I am not sure about this PR. I am trying not to change API w/o significant advantages. But I'll think more about this.

rofl0r commented 1 year ago

it would appear this PR makes code slower for no good reason other than to appear as "safe" as Annex K

iacore commented 1 year ago

I made this PR because I don't want to append the final \0 for mmaped files.

\0 can appear inside a normal file. The current parser stop parsing at first \0. Maybe it should be an error, or ignore the \0?

snej commented 11 months ago

Not all strings are nul-terminated. Raw text files are a common case, as mentioned above. And this comes up a lot in FFI: I’ve worked with several languages whose strings are passed into C as unterminated (pointer, length) pairs. IIRC both Go and Python do this. And then there’s C++’s std::string_view which can hold arbitrary substrings.

If the overhead of calling strlen is a problem, that could be removed; instead, just keep the nul check, and set the max len to a huge value in the existing function so the nul byte will be hit first. (There’s no valid reason to have a 00 byte in either ASCII or UTF-8 text.)

On the other hand, I see the string API as mostly for debugging, so is it really important to save the overhead of copying the text into a nul-terminated buffer?

iacore commented 7 months ago

If the overhead of calling strlen is a problem, that could be removed; instead, just keep the nul check, and set the max len to a huge value in the existing function so the nul byte will be hit first. (There’s no valid reason to have a 00 byte in either ASCII or UTF-8 text.)

I have applied the suggestion in the last commit.