stephenberry / glaze

Extremely fast, in memory, JSON and interface library for modern C++
MIT License
1.22k stars 121 forks source link

utf-8 encoding with BOM #1338

Closed TheFeelipe closed 1 month ago

TheFeelipe commented 1 month ago

there is a problem in the glz::read_file_json function to read json files with utf-8 encoding with BOM

apparently it cannot recognize the first bytes to recognize the file encoding, is there any way already existing in the library to recognize it?

error: expected_brace (5)

stephenberry commented 1 month ago

JSON requires UTF8, so a BOM shouldn't be needed. And, the BOM is not valid JSON for parsing into an object. So, having the BOM is technically invalid JSON.

8.1 of the RFC Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Here is a helpful discussion on the topic: JSON Specification and the usage of BOM

Since the specification says that the BOM may be ignored, I can add a compile time option to ignore reading the BOM and not error.

I'll keep this issue alive until that feature has been added.

Thanks for reporting this!

stephenberry commented 1 month ago

As I think about this more I don't really like the idea of supporting something that doesn't round-trip.

@TheFeelipe, I think it might be best to make your own file reading function into a std::string that discards this BOM.

If you can argue for why the BOM might be a good idea in some cases then I might reconsider. But, right now I'm thinking of avoiding this in Glaze because it isn't really something I want to encourage.

TheFeelipe commented 1 month ago

For some reason, characters with accents and similar bugs are bugged outside of utf8-bom or ansi, thinking about unicode it is more viable to use utf8-bom, at least on Windows, using visual studio 2022, even the encoding of .cpp pages needs to be in utf8-bom

about creating my own function to discard the good, yes it may be more viable

stephenberry commented 1 month ago

I'm closing this, but let me know if you ever think Glaze ought to add some features around BOM handling.