isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

Unexpected symbol '\ufeff' at 0 #73

Closed piskvorky closed 5 years ago

piskvorky commented 5 years ago

Python==3.6.5, ijson==2.3:

import ijson, io
list(ijson.parse(io.StringIO('\ufeff{"hello": "world"}')))

UnexpectedSymbol: Unexpected symbol '\ufeff' at 0

This happens routinely on Windows, which inserts the BOM character at the beginning of UTF8 files.

What is the recommended way to deal with parsing JSON file streams on Windows?

SalomonSmeke commented 5 years ago

Given that the issue here is just the BOM character, (this is an invalid json stream).

For any buffer class which passes issubclass(MyClass, TextIOBase) I would do something like:

from io import StringIO

windows_buf = StringIO("\ufeff{}")
normal_buf = StringIO("{}")
empty_buf = StringIO("")

BOM = "\ufeff"

def sanitize_TextIOBase(buf):
  first = buf.read(1)

  if first != BOM:
    buf.seek(0)

  return buf

print(sanitize_TextIOBase(windows_buf).read()) # {}
print(sanitize_TextIOBase(normal_buf).read()) # {}
print(sanitize_TextIOBase(empty_buf).read()) # (nothing)

Or if you're certain that you'll have some JSON eventually and just wanna sanitize the start of the file fully:

from io import StringIO

weird_buf = StringIO("1234{}")
normal_buf = StringIO("{}")
empty_buf = StringIO("")
bad_buf = StringIO("123")

JSON_OPEN_GRAMMARS = {"{", "["}

def sanitize_TextIOBase(buf):
  read_c = "__"
  predicate = lambda test_value: len(test_value) == 0 or test_value in JSON_OPEN_GRAMMARS
  seek = -1

  while not predicate(read_c):
    read_c = buf.read(1)
    seek += 1

  if read_c in JSON_OPEN_GRAMMARS:
    buf.seek(seek)

  return buf

print(sanitize_TextIOBase(weird_buf).read())
print(sanitize_TextIOBase(normal_buf).read())
print(sanitize_TextIOBase(empty_buf).read())
print(sanitize_TextIOBase(bad_buf).read())
SalomonSmeke commented 5 years ago

Here is some stack overflow information in case your buffer is not a subclass or TextIOBase: https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python

piskvorky commented 5 years ago

Thanks @SalomonSmeke , that's helpful.

Basically, there's no "built-in" support in ijson and we should sanitize inputs outside of ijson if possible (that seek(0) may be problematic in streams).

SalomonSmeke commented 5 years ago

@piskvorky no problem friend.

For a more generic answer be sure to check out that S/Overflow link!