pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

json encoding for python 2 #15715

Closed mbochk closed 6 years ago

mbochk commented 7 years ago

Code Sample, a copy-pastable example if possible

# not working
pd.read_json(path, encoding='cp1251')

# that works
import json 
with open(path, 'r') as f:
    js = json.load(f, encoding='cp1251')
pd.DataFrame(js)

Problem description

It is not mentioned explicitly in docstring that encoding option used in py3 only.

Currently pd.read_json mostly ignores encoding= option in python2. Function pd.common._get_handle warns about using encoding with compression, but silently continues without actually using encoding otherwise.

It looks like subtasks are split in unfavourable way to pass encoding up to json.loads call.

Expected Output

One might expect pandas use encoding, to get life easier (as pandas usually do ;) ). Or at least properly warn that option is ignored.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 28.8.0.post20161110 Cython: 0.24.1 numpy: 1.12.0 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.4.6 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.6.2 matplotlib: 2.0.0 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.4 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: 0.7.6.None psycopg2: None jinja2: 2.8 boto: 2.40.0 pandas_datareader: None
jreback commented 7 years ago

this is currently an open issue xref #13774

we have a tests but its not implemented on the writer side; the reader side should work.

can you provide an reproducible example showing this is not work. can add that as a test.

mbochk commented 7 years ago
# path = "path_to/example.txt"

try:
    # not working
    df1 = pd.read_json(path, encoding='cp1251')
except:
    print "pd read failed"
else:
    print "pd read complete"
try:
    import json
    with open(path, 'r') as f:
        js = json.load(f, encoding='cp1251')
    df2 = pd.DataFrame(js)
    assert df2.shape == (1, 19)
except:
    print "json read failed"
else:
    print "json read complete"

example.txt

I do achive "pd read failed", "json read complete" with attached 'example.txt'. I have to rename extension, but its should be valid json in 'cp1251' (notepad++ says 'windows-1251', it is synonym and gives same results).

jreback commented 7 years ago

yep I agree. something not getting decoded properly (works on py3, but not on 2). Want to have a look?

In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-4b2729700154> in <module>()
----> 1 pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
    347         obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
    348                           keep_default_dates, numpy, precise_float,
--> 349                           date_unit).parse()
    350 
    351     if typ == 'series' or obj is None:

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in parse(self)
    415 
    416         else:
--> 417             self._parse_no_numpy()
    418 
    419         if self.obj is None:

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in _parse_no_numpy(self)
    632         if orient == "columns":
    633             self.obj = DataFrame(
--> 634                 loads(json, precise_float=self.precise_float), dtype=None)
    635         elif orient == "split":
    636             decoded = dict((str(k), v)

ValueError: Invalid octet in UTF-8 sequence when decoding 'string'

3.5

In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
Out[1]: 
                                    ADRES AdmArea        DDOC  DMT        DREG  KAD_KV  KAD_RN  KAD_ZU       NDOC     NREG      SOOR  STRT                                      TDOC     UNOM  VLD  \
0  Бесединское шоссе, дом 17, строение 10      []  17.07.2015   17  22.07.2015       0       0       0  01-41-321  5015930  Строение    10  Распоряжение префектуры АО города Москвы  3811559  Дом   

                                         VYVAD                                            geoData  global_id  system_object_id  
0  адрес утвержден распорядительным документом  {'center': [[37.7690069572664, 55.623022198294...  163879706           3811559  
jreback commented 6 years ago

duplicate of #13774