logological / gpp

GPP, a generic preprocessor
https://logological.org/gpp
GNU Lesser General Public License v3.0
192 stars 33 forks source link

Add Support for UTF-8, detect other wide-char sequences, and cope with BOMs #30

Open duncanmac99 opened 5 years ago

duncanmac99 commented 5 years ago

As it stands, it seems that this program should almost handle UTF-8. The main task would be tinkering with one particular function, as well as (possibly) adding command-line args for handling certain peculiar (and often undesirable) situations.

logological commented 5 years ago

Further details on the proposed solution, or better yet, a pull request, would be most welcome.

duncanmac99 commented 5 years ago

However, the rest of the program expects regular (byte-size) characters, not wide characters. It would be possible to assemble it and not send back a wide character, but that would require more buffering in the function itself, which would be Messy.

As for BOMs (byte order marks), Windows now expects one at the beginning of every UTF-8 and UTF-16 file. For more on that (for UTF-8), see:

https://social.msdn.microsoft.com/Forums/windowsapps/en-US/dd352270-8790-4b48-8492-17a4a6875e99/why-the-utf8-with-bom-marker-requirement?forum=winappswithhtml5

Also (for UTF-16):

https://docs.microsoft.com/en-us/windows/desktop/intl/using-byte-order-marks

logological commented 5 years ago

I'm afraid I still don't understanding the problem. Can you post a minimal example of a UTF-8 or UTF-16 file that GPP doesn't handle correctly, along with the expected and observed output?