Closed piskvorky closed 5 years ago
Given that the issue here is just the BOM character, (this is an invalid json stream).
For any buffer class which passes issubclass(MyClass, TextIOBase)
I would do something like:
from io import StringIO
windows_buf = StringIO("\ufeff{}")
normal_buf = StringIO("{}")
empty_buf = StringIO("")
BOM = "\ufeff"
def sanitize_TextIOBase(buf):
first = buf.read(1)
if first != BOM:
buf.seek(0)
return buf
print(sanitize_TextIOBase(windows_buf).read()) # {}
print(sanitize_TextIOBase(normal_buf).read()) # {}
print(sanitize_TextIOBase(empty_buf).read()) # (nothing)
Or if you're certain that you'll have some JSON eventually and just wanna sanitize the start of the file fully:
from io import StringIO
weird_buf = StringIO("1234{}")
normal_buf = StringIO("{}")
empty_buf = StringIO("")
bad_buf = StringIO("123")
JSON_OPEN_GRAMMARS = {"{", "["}
def sanitize_TextIOBase(buf):
read_c = "__"
predicate = lambda test_value: len(test_value) == 0 or test_value in JSON_OPEN_GRAMMARS
seek = -1
while not predicate(read_c):
read_c = buf.read(1)
seek += 1
if read_c in JSON_OPEN_GRAMMARS:
buf.seek(seek)
return buf
print(sanitize_TextIOBase(weird_buf).read())
print(sanitize_TextIOBase(normal_buf).read())
print(sanitize_TextIOBase(empty_buf).read())
print(sanitize_TextIOBase(bad_buf).read())
Here is some stack overflow information in case your buffer is not a subclass or TextIOBase: https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python
Thanks @SalomonSmeke , that's helpful.
Basically, there's no "built-in" support in ijson
and we should sanitize inputs outside of ijson
if possible (that seek(0)
may be problematic in streams).
@piskvorky no problem friend.
For a more generic answer be sure to check out that S/Overflow link!
Python==3.6.5, ijson==2.3:
This happens routinely on Windows, which inserts the BOM character at the beginning of UTF8 files.
What is the recommended way to deal with parsing JSON file streams on Windows?