Assertion/crash if invalid UTF-8 text is passed to InputText

SlavicPotato commented 1 year ago

Version/Branch of Dear ImGui:

Version: 7c291ba31b3ad0651ad6341bf08d3bb9eeae2453 Branch: master

Back-end/Renderer/Compiler/OS

Back-ends: imgui_impl_dx11.cpp + imgui_impl_win32.cpp Compiler: msvc current Operating System: win10

My Issue:

Feeding invalid UTF-8 text to ImGui::InputText asserts here, if allowed to proceed it crashes in InputTextCallback due to bad BufTextLen passed to string::resize once the input text is emptied ("string too long" exception is raised). With a char array there are no immediate side effects if the assert is ignored.

Standalone, minimal, complete and verifiable example:

static char buf[255] = "\xE1\xC2"; // éÉ

ImGui::InputText("##input", buf, sizeof(buf));

or

static std::string buf = "\xE1\xC2";

ImGui::InputText("##input", &buf);

Empty the input field (i.e. Ctrl+A, Backspace) to trigger.

GamingMinds-DanielC commented 1 year ago

The assertion is on purpose, telling you about wrong usage, crashing after that is the only real option here. If the library tried to "fix" strings for you, it would effectively alter data in a way that might not be obvious to the users of your application. Just sanitize your UTF-8 strings before passing them along.

SlavicPotato commented 1 year ago

If the library tried to "fix" strings for you

I don't expect it to, handling invalid data a bit more gracefully would be nice though.

crashing after that is the only real option here

Doubt that.

Just sanitize your UTF-8 strings before passing them along.

I do, but ideally that shouldn't be necessary to avoid a crash.

ocornut commented 1 year ago

We ought to crash/assert on invalid code/logic but generally not on invalid data. We’ll look into it. A few years ago we did fix and test various invalid utf-8 cases but maybe something slipped thru the cracks or regressed.

GamingMinds-DanielC commented 1 year ago

crashing after that is the only real option here

Doubt that.

Well, with display only strings I see the issue with crashing, but since it is an input text I stand by my point. If an invalid UTF-8 string is edited, what should ImGui do? return a fixed string that has been manipulated beyond what the user edited? That could be counted as "corrupting" data. Or keep the invalid encoding and return an invalid UTF-8 string? Also not a great solution. With a crash (but that should be a controlled one, not just on certain sub-conditions) both bad options would be eliminated, forcing the app developer to fix it properly.

SlavicPotato commented 1 year ago

Well defined behavior when handling invalid data should generally be the norm, but there's no reasonable argument to be made that it should lead to a crash. No UI library that I'm aware of would handle it that way, both alternatives you brought up would be preferrable. Substituting with some predefined glyphs, for example, would do just fine. Moreover, that data could come from any external user supplied source, having to santitize it means more work for the developer and may incur significant performance penalties depending on the application. ocornut's stance that we shouldn't crash on invalid data is very much on point here.

With a crash (but that should be a controlled one, not just on certain sub-conditions)

Not sure what you're talking about here. This issue (negative arg passed to stl string::resize) is a result of UB, are you suggesting that the library should (or does) deliberately invoke a crash or raise an exception?

GamingMinds-DanielC commented 1 year ago

Well defined behavior when handling invalid data should generally be the norm, but there's no reasonable argument to be made that it should lead to a crash.

I'm not talking about "random" crashes here, those should of course be avoided. But well defined behavior could include guaranteed reproducible crashes on invalid input data (if you want to call an assertion that then terminates the process a crash). With that, mistakes are often easy to find for the developer by just looking into the call stack.

Also, of course there are situations where a crash is the better option. Not the best one overall, but possibly the best option open to a library that is not in possession of the data that is being worked on. Loosing work since the last save is frustrating. But now imagine loosing the work of potentially months (some people have bad or no backup procedures) because the application allowed you to continue working with and even save broken data. Because a library with little to no information about that data seemingly "handled" what it considered invalid instead of forcing the app to handle it properly by otherwise crashing.

And yes, raising an exception (which is basically a "crash" if not handled outside of the library) would be a proper way to react to bad inputs. Better that allowing the app to ignore it.

ocornut commented 1 year ago

We have a private branch with a partial rewrite of InputText rewriting some of the UTF-8<>ImWchar back and forth conversions which would make it easier for us to avoid being lossy on malformed UTF-8.

Practically speaking, I don't think anyone would CARE about saving contents with malformed UTF-8 to losslessly preserve the malformed UTF-8 bytes, so I wouldn't worry too much about it to be honest. I'm mostly concerned about not crashing if e.g. a random buffer of bytes is used in InputText(). This is low-priority but will investigate eventually.

SlavicPotato commented 1 year ago

Agreed, for example I'm concerned with users making bad data file edits (e.g. using latin 1s) leading to crashes. ImGui documentation is pretty clear that it only supports UTF-8, if my app inherits that requirement and someone ignores it and this leads to data loss, imo that's their fault for not rtfm.

ocornut commented 1 year ago

I confirmed that the bug is due to our back-and-forth between UTF-8 and decoded ImWchar and it seems not trivial to fix with the current code, rather it may have to wait until we unleash a new version of InputText().

ocornut / imgui

Assertion/crash if invalid UTF-8 text is passed to InputText #6397