pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 17.99k forks source link

Incorrect json round-trip with orient='table' when dataframe contains duplicate index values #29025

Open afoda opened 5 years ago

afoda commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({"col1": [1, 2]}, index=[0, 0])
print(pd.read_json(df.to_json(orient='table'), orient='table'))
print(pd.read_json(df.to_json(orient='split'), orient='split'))

Output:

   index  col1
0      0     1
1      0     2
   col1
0     1
0     2

Problem description

I would expect the json roundtrip to output a dataframe that is identical to the input, regardless of the value of the orient parameter.

Expected Output

   col1
0     1
0     2
   col1
0     1
0     2

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-50-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.16.4 pytz : 2019.3 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : None pytest : 5.0.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 0.9.3 psycopg2 : 2.8.3 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.0 sqlalchemy : 1.3.5 tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None
gabriellm1 commented 5 years ago

Can I take this?

WillAyd commented 5 years ago

I don't think the JSON table schema specifies anything about the index:

https://pre-v1.frictionlessdata.io/json-table-schema/#schema

Could maybe argue the index could be represented as primaryKeys but let's see what others think

TomAugspurger commented 5 years ago

We write it as the primaryKey when it's unique

The schema field also contains a primaryKey field if the (Multi)index is unique.

from https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#table-schema

WillAyd commented 5 years ago

Ah thanks Tom. Any chance you know the history of that requirement? Is it because primary keys in databases are typically unique?

The JSON table schema doesn't require uniqueness for primary keys and that actually gets controlled by a unique constraint in the schema:

https://pre-v1.frictionlessdata.io/json-table-schema/#field-constraints

Was discussed in the issue to add it there as well:

https://github.com/dataprotocols/dataprotocols/issues/21

So this seems like it could be supported

TomAugspurger commented 5 years ago

Huh, I read it as needing to be unique: https://pre-v1.frictionlessdata.io/json-table-schema/#primary-key

A primary key is a field or set of fields that uniquely identifies each row in the table.

But I may be wrong.