mayah / tinytoml

A header only C++11 library for parsing TOML
BSD 2-Clause "Simplified" License
167 stars 31 forks source link

toml::parse() fails if UTF-8 BOM is present. #20

Closed iamwired closed 7 years ago

iamwired commented 7 years ago

Hi,

With Visual Studio 2015, the following code :

{
    std::ifstream ifs(<utf-8 file with BOM marker>);
    toml::ParseResult pr = toml::parse(ifs);
    // ...
}

raised this assertion in the crt:

File: minkernel\crts\ucrt\src\appcrt\convert\isctype.cpp Line: 36 Expression: c >= -1 && c <= 255

in this line,

while (current(&c) && (isalnum(c) || c == '_' || c == '-')) {

of the Lexer::nextKey() function, when the 3 char of the UTF-8 BOM is present in front of the stream.

The code works well when the BOM is not present.

This workaround resolved the problem for me:

{
    // ...
    std::filebuf* pbuf = ifs.rdbuf();
    unsigned char cUTF8_BOM[3];
    pbuf->sgetn((char *)cUTF8_BOM, sizeof(cUTF8_BOM));
    // UTF-8 BOM?
    const unsigned char g_UTF8_BOM[3] = { 0xEF, 0xBB, 0xBF };
    if ((g_UTF8_BOM[0] == cUTF8_BOM[0]) && (g_UTF8_BOM[1] == cUTF8_BOM[1]) && (g_UTF8_BOM[2] == cUTF8_BOM[2]))
        pbuf->pubseekpos(sizeof(g_UTF8_BOM), std::ios_base::cur);

    // ...
}

But I believe it could be easily fixed in your library.

Sincerely.

mayah commented 7 years ago

Actually I really hate UTF-8 BOM since it's nonsense, however https://github.com/toml-lang/toml/issues/437 says UTF-8 BOM must be allowed, so I'll fix it later.