Closed mbochk closed 6 years ago
this is currently an open issue xref #13774
we have a tests but its not implemented on the writer side; the reader side should work.
can you provide an reproducible example showing this is not work. can add that as a test.
# path = "path_to/example.txt"
try:
# not working
df1 = pd.read_json(path, encoding='cp1251')
except:
print "pd read failed"
else:
print "pd read complete"
try:
import json
with open(path, 'r') as f:
js = json.load(f, encoding='cp1251')
df2 = pd.DataFrame(js)
assert df2.shape == (1, 19)
except:
print "json read failed"
else:
print "json read complete"
I do achive "pd read failed", "json read complete" with attached 'example.txt'. I have to rename extension, but its should be valid json in 'cp1251' (notepad++ says 'windows-1251', it is synonym and gives same results).
yep I agree. something not getting decoded properly (works on py3, but not on 2). Want to have a look?
In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-4b2729700154> in <module>()
----> 1 pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
347 obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
348 keep_default_dates, numpy, precise_float,
--> 349 date_unit).parse()
350
351 if typ == 'series' or obj is None:
/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in parse(self)
415
416 else:
--> 417 self._parse_no_numpy()
418
419 if self.obj is None:
/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in _parse_no_numpy(self)
632 if orient == "columns":
633 self.obj = DataFrame(
--> 634 loads(json, precise_float=self.precise_float), dtype=None)
635 elif orient == "split":
636 decoded = dict((str(k), v)
ValueError: Invalid octet in UTF-8 sequence when decoding 'string'
3.5
In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
Out[1]:
ADRES AdmArea DDOC DMT DREG KAD_KV KAD_RN KAD_ZU NDOC NREG SOOR STRT TDOC UNOM VLD \
0 Бесединское шоссе, дом 17, строение 10 [] 17.07.2015 17 22.07.2015 0 0 0 01-41-321 5015930 Строение 10 Распоряжение префектуры АО города Москвы 3811559 Дом
VYVAD geoData global_id system_object_id
0 адрес утвержден распорядительным документом {'center': [[37.7690069572664, 55.623022198294... 163879706 3811559
duplicate of #13774
Code Sample, a copy-pastable example if possible
Problem description
It is not mentioned explicitly in docstring that
encoding
option used in py3 only.Currently
pd.read_json
mostly ignoresencoding=
option in python2. Functionpd.common._get_handle
warns about using encoding with compression, but silently continues without actually usingencoding
otherwise.It looks like subtasks are split in unfavourable way to pass encoding up to
json.loads
call.Expected Output
One might expect pandas use encoding, to get life easier (as pandas usually do ;) ). Or at least properly warn that option is ignored.
Output of
pd.show_versions()