hydrobyte / McJSON

A Delphi / Lazarus / C++Builder simple and small class for fast JSON parsing.
MIT License
58 stars 20 forks source link

Invalid char at pos 1 (UTF8 BOM support) #10

Closed totyaxy closed 1 year ago

totyaxy commented 1 year ago

Hi!

Unfortunately, many utf8 json files contain a BOM, which McJSON cannot read. I solved this in Lazarus, of course, but I don't know how the code would work under Delphi (UTF16?), so I don't offer a pull request. So my request/recommendation would be that you also support UTF8 BOM encoded files, at least for reading (I did it for writing as well, but it's not that important). The code itself is not complicated, because if a BOM is detected in LoadFromStream, you only need to position it after the BOM (but the reading moved it anyway). UTF8 BOM: $EF $BB $BF See: https://learn.microsoft.com/en-us/globalization/encoding/byte-order-mark Since you also support Delphi, as far as I know, UTF16 encoding is the basis there, so then UTF16LE/BE BOMs could also be supported.

hydrobyte commented 1 year ago

Hi,

Nice hint! That was in my plans.

I think it is simple to implement BOM support and LoadFromStream() is indeed the right place to filter the enconding.

Stay tuned.

Regards,

totyaxy commented 1 year ago

Thank you!

hydrobyte commented 1 year ago

Hi, about UTF-8 BOM, please test the latest update.

totyaxy commented 1 year ago

Hi, I saw the BOM bypass code, thank you! I would add a tip to be faster (we need the speed), because this solution will completely copy the string, even if it is very big, and this is an unnecessary operation. I think we should detect Bom as a stream, and then:

if utf8bomheaderdetect (stream) THEN Stream.seek (Length (Cutf8bomheader), Sofromcurrent);

Because stream.read will read from Position(!). So you just have to slip the position. Or simply set the PostionAfterBOM to TRUE in UTF8BOMHeaderDetect, my quick code for this:

type
  TUTF8BOMHeader = array [0..2] of byte;

const
  //https://en.wikipedia.org/wiki/Byte_order_mark
  cUTF8BOMHeader: TUTF8BOMHeader = ($EF, $BB, $BF);

function UTF8BOMHeaderDetect(const Stream: TStream;
  const PostionAfterBOM: boolean = False): boolean;
var
  UTF8BOMHeader: TUTF8BOMHeader;
  StreamPositionBuf: int64;
begin
  Result := False;

  Initialize(UTF8BOMHeader, Length(UTF8BOMHeader));

  if Stream.Size < Length(cUTF8BOMHeader) then
    exit;

  StreamPositionBuf := Stream.Position;
  Stream.Read(UTF8BOMHeader, Length(UTF8BOMHeader));

  if CompareMem(@UTF8BOMHeader, @cUTF8BOMHeader, Length(cUTF8BOMHeader)) then
  begin
    if not (PostionAfterBOM) then Stream.Position := StreamPositionBuf;
    Result := True;
  end
  else
    Stream.Position := StreamPositionBuf;
end;
totyaxy commented 1 year ago

But by the way, your code also loads the UTF8 BOM encoded file, thanks!

hydrobyte commented 1 year ago

Nice hint about performance. I'll give it a try.

Thanks.