pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.3k stars 17.8k forks source link

Add JSON export option for DataFrame #631

Closed wesm closed 12 years ago

aman-thakral commented 12 years ago

I actually need to do this for a current project I'm working on. I'll get started on tackling this if it is open issue. I will probably be using the gviz api as reference (http://code.google.com/p/google-visualization-python/).

wesm commented 12 years ago

By all means go right ahead. @mikedewar may also be interested for his project https://github.com/mikedewar/D3py

mikedewar commented 12 years ago

Would be happy to see this exist! In fact I made a gist a while ago to do it:

https://gist.github.com/1486027

Please feel free to use as a starting point! Probably could do with a bit more consideration in terms of multiple levels of keys and other stuff about data frames that I don't know about yet.

wesm commented 12 years ago

Now if we want to be truly hardcore (and why wouldn't we be?) we should fork UltraJSON and make it DataFrame-specific to get the best performance

aman-thakral commented 12 years ago

An interesting idea. I'll have to examine the code, although my experience with C is somewhat limited. I may need to do some serious review, but it will be excellent practice nonetheless. Also, I had a look at the google-visualization-python api and I like the use of a "table description" that you can pass it to define the desired structure of the json string. This provides a great deal of flexibility that would be really useful, and would make using the string in something like google charts really easy.

Komnomnomnom commented 12 years ago

Hi all,

I've done some preliminary work in this direction. In my fork of usjon I've added some basic support for numpy. Right now it just handles some of the basic numpy scalars and 1D arrays. The implementation isn't perfect (I'm a bit concerned with casting everything) but it seems to work ok. The goal is to eventually add support for numpy N-dimensional arrays (possibly with a max limit on N) and pandas data types, specifically Series and DataFrame.

It's my first time dealing with the Python and Numpy C-APIs so any comments are welcome!

https://github.com/Komnomnomnom/ultrajson/commit/511ec035957817fb309577bafe267bc2b771f547

Komnomnomnom commented 12 years ago

Encoding support for DataFrame, Series and Index is now committed, as well as proper support for encoding numpy arrays. Still not sure how to properly handle decoding, right now I'm just passing the decoded dict / list to the relevant data-type's constructor.

I decided to encode the DataFrame index and column labels separately (it suits my purposes and I think it's more efficient to work on the underlying numpy arrays). So you end up with something like:

>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.encode(df)
'{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'
wesm commented 12 years ago

I think what's needed for @mikedewar's needs and others would be:

'[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'

when you deserialize that and pass it to DataFrame, you get back the same DataFrame:

In [3]: DataFrame(json.loads('[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'))
Out[3]: 
   a  b  c
0  1  2  3
1  4  5  6

However, this doesn't give you the row index, but that's not a big deal for the particular use case (feeding a DataFrame into d3 or something else)

Komnomnomnom commented 12 years ago

Ok, I was initially going to match the output of the to_dict() method but preferred the output above for my purposes. Note you can still recreate the DataFrame using:

>>> DataFrame(**ujson.loads('{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'))
   x  y  z
a  1  2  3
b  4  5  6

That said I don't think it would be too difficult to add an option to produce output like you mentioned. How about a labelled option where the output would be identical to the to_dict() method. e.g.

>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.dumps(df, labelled=True)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'

Or is it absolutely necessary to suppress the index labels?

wesm commented 12 years ago

I'm thinking it might be preferable to ship the relevant ultrajson code in pandas and use it to implement Series.to_json and DataFrame.to_json. But having multiple output options makes sense, including the "records format" where the index is ignored, or could be put in each JSON object in the list

Komnomnomnom commented 12 years ago

Agreed, it would make sense for it to be included in pandas.

I think all the ujson code is required (as it will still have to deal with basic types), albeit tailored for numpy and pandas types. I can fork and attempt to introduce it into pandas if you point me in the right direction. Ujson is composed of several different c files, I'm not sure where to put them and how to include them in the build process.

wesm commented 12 years ago

You would want to put it in a subdirectory of pandas/src and co-opt the extension configuration from the UltraJSON setup.py file

Komnomnomnom commented 12 years ago

I've finally got around to revisiting this. I've added support to my fork of ujson for different output formats when encoding pandas data types:

In [4]: df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z']) 

In [5]: ujson.encode(df, format="headers")
Out[5]: '{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'

In [6]: ujson.encode(df, format="records")
Out[6]: '[{"x":1,"y":2,"z":3},{"x":4,"y":5,"z":6}]'

In [7]: ujson.encode(df, format="indexed")
Out[7]: '{"a":{"x":1,"y":2,"z":3},"b":{"x":4,"y":5,"z":6}}'

In [8]: ujson.encode(df, format="column_indexed")
Out[8]: '{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'

If format isn't specified encoding defaults to the column_indexed format as it matches the output of to_dict() and it can be given straight to the DataFrame constructor. All of the encoding / iteration is performed in ujson in C.

I've added similar support for Series and Index (although some of the formats don't suit them it tries to handle them sensibly)


In [9]: s = Series([10, 20, 30, 40, 50, 60], name="myseries", index=[6,7,8,9,10,15])

In [10]: ujson.encode(s, format="headers")
Out[10]: '{"name":"myseries","index":[6,7,8,9,10,15],"data":[10,20,30,40,50,60]}'

In [11]: ujson.encode(s, format="records")
Out[11]: '[10,20,30,40,50,60]'

In [12]: ujson.encode(s, format="indexed")
Out[12]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'

In [13]: ujson.encode(s, format="column_indexed")
Out[13]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'

In [14]: i = Index([23, 45, 18, 98, 43, 11], name="myindex")

In [15]: ujson.encode(i, format="headers")
Out[15]: '{"name":"myindex","data":[23,45,18,98,43,11]}'

In [16]: ujson.encode(i, format="records")
Out[16]: '[23,45,18,98,43,11]'

In [17]: ujson.encode(i, format="indexed")
Out[17]: '[23,45,18,98,43,11]'

In [18]: ujson.encode(i, format="column_indexed")
Out[18]: '[23,45,18,98,43,11]'

My next step is to integrate this into pandas but I'd welcome any comments. Are there values for the format argument that would fit better with existing pandas code?

wesm commented 12 years ago

Hm, I'll think about the API. What you propose looks pretty good and you could just go for that for now, adding a to_json method to Series and DataFrame. It think would make sense to ship a pared down version of ujson in pandas (and have lots of tests, of course). Could put the source code in pandas.io or somewhere like that.

Komnomnomnom commented 12 years ago

ujson is pure C, no python file except for setup.py and some test classes. I think all of it is required though (apart from its test code and metafiles) so it can properly handle whatever type happens to be in the DataFrame etc.

wesm commented 12 years ago

Right, so you would just need to set it up to build as a submodule inside pandas and wire it up with the new object instance methods, and write appropriate tests. If you do some of the heavy lifting to set this up and make a pull request I can integrate and round things out in a few weeks

Komnomnomnom commented 12 years ago

Hi Wes,

I've improved the performance a bit and made some other tweaks and improvements, most notably I've added support for direct decoding to numpy arrays which gets rid of the list to numpy array conversion step.

I've updated the README on my fork with more information and some simple benchmarks, https://github.com/Komnomnomnom/ultrajson. Although there were a couple of surprises I'm pretty happy with the overall performance.

Integrating with pandas and the pandas build was a lot more straightforward than I expected. I should send through a pull request later on today (I'll attach it to this issue if I can).

Oh and I've changed the format argument to 'orient', seems to fit better with other DataFrame methods and format clashes with a Python built-in. I also added the 'values' format which only encodes the DataFrame values array, ignoring column and index labels.

wesm commented 12 years ago

Addressed by #1263, #1309

PhE commented 12 years ago

All issues related to DataFrame.to_json() seems closed, but on version 0.8.1 there is not DataFrame.to_json() method. Is this feature released ?

changhiskhan commented 12 years ago

It's not part of pandas for now due to issues with MinGW. It's in a separate project for now and we will revisit this issue when we can. Thanks. On Aug 16, 2012 10:18 AM, "Philippe Entzmann" notifications@github.com wrote:

All issues related to DataFrame.to_json() seems closed, but on version 0.8.1 there is not DataFrame.to_json() method. Is this feature released ?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/631#issuecomment-7786674.

PhE commented 12 years ago

MinGW issues are Windows related, I suppose. I'm on Linux, can you point me to the project/branch ? (I am not a git/github master) Thanks.

wesm commented 12 years ago

it's pydata/pandasjson