Closed wesm closed 12 years ago
By all means go right ahead. @mikedewar may also be interested for his project https://github.com/mikedewar/D3py
Would be happy to see this exist! In fact I made a gist a while ago to do it:
https://gist.github.com/1486027
Please feel free to use as a starting point! Probably could do with a bit more consideration in terms of multiple levels of keys and other stuff about data frames that I don't know about yet.
Now if we want to be truly hardcore (and why wouldn't we be?) we should fork UltraJSON and make it DataFrame-specific to get the best performance
An interesting idea. I'll have to examine the code, although my experience with C is somewhat limited. I may need to do some serious review, but it will be excellent practice nonetheless. Also, I had a look at the google-visualization-python api and I like the use of a "table description" that you can pass it to define the desired structure of the json string. This provides a great deal of flexibility that would be really useful, and would make using the string in something like google charts really easy.
Hi all,
I've done some preliminary work in this direction. In my fork of usjon I've added some basic support for numpy. Right now it just handles some of the basic numpy scalars and 1D arrays. The implementation isn't perfect (I'm a bit concerned with casting everything) but it seems to work ok. The goal is to eventually add support for numpy N-dimensional arrays (possibly with a max limit on N) and pandas data types, specifically Series and DataFrame.
It's my first time dealing with the Python and Numpy C-APIs so any comments are welcome!
https://github.com/Komnomnomnom/ultrajson/commit/511ec035957817fb309577bafe267bc2b771f547
Encoding support for DataFrame, Series and Index is now committed, as well as proper support for encoding numpy arrays. Still not sure how to properly handle decoding, right now I'm just passing the decoded dict / list to the relevant data-type's constructor.
I decided to encode the DataFrame index and column labels separately (it suits my purposes and I think it's more efficient to work on the underlying numpy arrays). So you end up with something like:
>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.encode(df)
'{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'
I think what's needed for @mikedewar's needs and others would be:
'[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'
when you deserialize that and pass it to DataFrame, you get back the same DataFrame:
In [3]: DataFrame(json.loads('[{"a":1,"b":2,"c":3}, {"a":4,"b":5,"c":6}]'))
Out[3]:
a b c
0 1 2 3
1 4 5 6
However, this doesn't give you the row index, but that's not a big deal for the particular use case (feeding a DataFrame into d3 or something else)
Ok, I was initially going to match the output of the to_dict()
method but preferred the output above for my purposes. Note you can still recreate the DataFrame
using:
>>> DataFrame(**ujson.loads('{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'))
x y z
a 1 2 3
b 4 5 6
That said I don't think it would be too difficult to add an option to produce output like you mentioned. How about a labelled
option where the output would be identical to the to_dict()
method. e.g.
>>> df = DataFrame([[1,2,3], [4,5,6]], columns=['x', 'y', 'z'], index=['a', 'b'])
>>> ujson.dumps(df, labelled=True)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'
Or is it absolutely necessary to suppress the index labels?
I'm thinking it might be preferable to ship the relevant ultrajson code in pandas and use it to implement Series.to_json
and DataFrame.to_json
. But having multiple output options makes sense, including the "records format" where the index is ignored, or could be put in each JSON object in the list
Agreed, it would make sense for it to be included in pandas.
I think all the ujson code is required (as it will still have to deal with basic types), albeit tailored for numpy and pandas types. I can fork and attempt to introduce it into pandas if you point me in the right direction. Ujson is composed of several different c files, I'm not sure where to put them and how to include them in the build process.
You would want to put it in a subdirectory of pandas/src and co-opt the extension configuration from the UltraJSON setup.py
file
I've finally got around to revisiting this. I've added support to my fork of ujson for different output formats when encoding pandas data types:
In [4]: df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z'])
In [5]: ujson.encode(df, format="headers")
Out[5]: '{"columns":["x","y","z"],"index":["a","b"],"data":[[1,2,3],[4,5,6]]}'
In [6]: ujson.encode(df, format="records")
Out[6]: '[{"x":1,"y":2,"z":3},{"x":4,"y":5,"z":6}]'
In [7]: ujson.encode(df, format="indexed")
Out[7]: '{"a":{"x":1,"y":2,"z":3},"b":{"x":4,"y":5,"z":6}}'
In [8]: ujson.encode(df, format="column_indexed")
Out[8]: '{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'
If format
isn't specified encoding defaults to the column_indexed
format as it matches the output of to_dict()
and it can be given straight to the DataFrame
constructor. All of the encoding / iteration is performed in ujson in C.
I've added similar support for Series and Index (although some of the formats don't suit them it tries to handle them sensibly)
In [9]: s = Series([10, 20, 30, 40, 50, 60], name="myseries", index=[6,7,8,9,10,15])
In [10]: ujson.encode(s, format="headers")
Out[10]: '{"name":"myseries","index":[6,7,8,9,10,15],"data":[10,20,30,40,50,60]}'
In [11]: ujson.encode(s, format="records")
Out[11]: '[10,20,30,40,50,60]'
In [12]: ujson.encode(s, format="indexed")
Out[12]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'
In [13]: ujson.encode(s, format="column_indexed")
Out[13]: '{"6":10,"7":20,"8":30,"9":40,"10":50,"15":60}'
In [14]: i = Index([23, 45, 18, 98, 43, 11], name="myindex")
In [15]: ujson.encode(i, format="headers")
Out[15]: '{"name":"myindex","data":[23,45,18,98,43,11]}'
In [16]: ujson.encode(i, format="records")
Out[16]: '[23,45,18,98,43,11]'
In [17]: ujson.encode(i, format="indexed")
Out[17]: '[23,45,18,98,43,11]'
In [18]: ujson.encode(i, format="column_indexed")
Out[18]: '[23,45,18,98,43,11]'
My next step is to integrate this into pandas but I'd welcome any comments. Are there values for the format
argument that would fit better with existing pandas code?
Hm, I'll think about the API. What you propose looks pretty good and you could just go for that for now, adding a to_json
method to Series and DataFrame. It think would make sense to ship a pared down version of ujson in pandas (and have lots of tests, of course). Could put the source code in pandas.io or somewhere like that.
ujson is pure C, no python file except for setup.py and some test classes. I think all of it is required though (apart from its test code and metafiles) so it can properly handle whatever type happens to be in the DataFrame etc.
Right, so you would just need to set it up to build as a submodule inside pandas and wire it up with the new object instance methods, and write appropriate tests. If you do some of the heavy lifting to set this up and make a pull request I can integrate and round things out in a few weeks
Hi Wes,
I've improved the performance a bit and made some other tweaks and improvements, most notably I've added support for direct decoding to numpy arrays which gets rid of the list to numpy array conversion step.
I've updated the README on my fork with more information and some simple benchmarks, https://github.com/Komnomnomnom/ultrajson. Although there were a couple of surprises I'm pretty happy with the overall performance.
Integrating with pandas and the pandas build was a lot more straightforward than I expected. I should send through a pull request later on today (I'll attach it to this issue if I can).
Oh and I've changed the format argument to 'orient', seems to fit better with other DataFrame methods and format clashes with a Python built-in. I also added the 'values' format which only encodes the DataFrame values array, ignoring column and index labels.
Addressed by #1263, #1309
All issues related to DataFrame.to_json() seems closed, but on version 0.8.1 there is not DataFrame.to_json() method. Is this feature released ?
It's not part of pandas for now due to issues with MinGW. It's in a separate project for now and we will revisit this issue when we can. Thanks. On Aug 16, 2012 10:18 AM, "Philippe Entzmann" notifications@github.com wrote:
All issues related to DataFrame.to_json() seems closed, but on version 0.8.1 there is not DataFrame.to_json() method. Is this feature released ?
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/631#issuecomment-7786674.
MinGW issues are Windows related, I suppose. I'm on Linux, can you point me to the project/branch ? (I am not a git/github master) Thanks.
it's pydata/pandasjson
I actually need to do this for a current project I'm working on. I'll get started on tackling this if it is open issue. I will probably be using the gviz api as reference (http://code.google.com/p/google-visualization-python/).