Closed wesm closed 11 years ago
this is invalid JSON (according to jsonlint), but generalized inference is IMHO too complicated, but #3804 should be able to do some of this. close this issue?
Is it feasible to grab the user section, (actually this example from the other thread is better):
Convert the (data, posts)
section to DataFrame
s = r'''{
"status": "success",
"data": {
"posts": [
{
"id": 1,
"title": "A blog post",
"body": "Some useful content"
},
{
"id": 2,
"title": "Another blog post",
"body": "More content"
}
]
}
}'''
read_json(s, grab_nest=(data, posts)) # some better argument name
body id title
0 Some useful content 1 A blog post
1 More content 2 Another blog post
This is probably getting too cute.....
In [52]: def extract(df, l):
....: for e in l:
....: df = df[e]
....: return df
....:
In [54]: DataFrame.extract = extract
In [56]: DataFrame(pd.read_json(s).extract(['data','posts']))
Out[56]:
body id title
0 Some useful content 1 A blog post
1 More content 2 Another blog post
Ha! Perhaps less overhead to do pd.DataFrame(extract(pd.io.json.loads(s), ('data', 'posts')))
, but in either case we would lose the datetime parsing atm.
This could be a reasonably ok solution... tricky with orient (?), then parse_dates or whatever?
...other choice is just to loads/dumps/parse? :s
here's a bigish nested json: https://github.com/hayd/sf-city-lots-json ~200mb
I think I'd want to extract ['features']
.
I'm on an incredibly old macbook air, hence slow timings:
In [9]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 29.13 s, sys: 28.27 s, total: 57.40 s
Wall time: 304.71 s
In [10]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 11.96 s, sys: 11.79 s, total: 23.75 s
Wall time: 136.50 s
In [11]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())CPU times: user 13.47 s, sys: 10.41 s, total: 23.88 s
Wall time: 77.47 s
What is an extreme for reading in json?
After I figured out all I needed to do was clone the repository! (these also include full dtype conversions) FYI
In [3]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 13.03 s, sys: 0.50 s, total: 13.53 s
Wall time: 15.12 s
In [6]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.13 s
In [7]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())
CPU times: user 6.27 s, sys: 0.16 s, total: 6.44 s
Wall time: 6.45 s
see #3876
Wow, I should never do any data analysis on that laptop... (sorry I forgot that you had to clone it).
But what I mean is, you'd lose the control from the read_json arguments. Will be interesting to see if this use case comes up a lot "in the wild".
... really this really this kind of stuff should be done the other end, e.g. with http://www.elasticsearch.org/guide/reference/api/get/ (getting the _source directly).
I have a JSON normalization function I can clean up and make a PR before before anyone goes crazy writing one to save you some time. It would be nice to have a higher performance one at some point though
Hey @wesm and @hayd! I've been keeping my eye on this thread for a few days now—really impressive all the work that built up to this, thank you. Anyways, I though you might know that I could go crazy writing some JSON normalization soon. Perhaps I should wait? :)
Thanks for all.
Related: https://groups.google.com/forum/#!topic/pydata/XkiWtZKT698 (json is a list of nested dictionaries...)
Hey @nehalecky , I think @wesm says he has something in the works already, so perhaps if you can hold off til he's pushed, then you could hack that? :)
Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON). Thanks, Vishal { u'_id': ObjectId('4f8b95e8a504d022e2000000'), u'contributors': None, u'coordinates': None, u'created_at': u'Mon Apr 16 03:45:44 +0000 2012', u'entities': { u'hashtags': [], u'urls': [ { u'display_url': u'stks.co/3M4X', u'expanded_url': u'http://stks.co/3M4X', u'indices': [72, 92], u'url': u'http://t.co/mGdTYusF'}], u'user_mentions': []}, u'favorited': False, u'geo': None, u'id': 191734090783916032L, u'id_str': u'191734090783916032', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'place': None, u'possibly_sensitive': False, u'possibly_sensitive_editable': True, u'processed_metadata': { u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>), u'search_queries': [ u'$AAPL', u'$MSFT', u'$GOOG'], u'source': u'Twitter Streaming API'}, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://stocktwits.com" rel="nofollow">StockTwits Web</a>', u'text': u'Interesting infographic on the internet and evolution of social media \u2794 http://t.co/mGdTYusF $FB $GOOG $TWIT $LNKD $AOL', u'truncated': False, u'user': { u'_id': u'speculatethemkt', u'contributors_enabled': False, u'created_at': u'Tue Nov 30 02:28:20 +0000 2010', u'default_profile': False, u'default_profile_image': False, u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794 http://amzn.to/I0nacY", u'favourites_count': 1, u'follow_request_sent': None, u'followers_count': 19658, u'following': None, u'friends_count': 596, u'geo_enabled': False, u'id': 221226895, u'id_str': u'221226895', u'is_translator': False, u'lang': u'en', u'listed_count': 6, u'location': u'Portland, OR', u'name': u'Jared M.', u'notifications': None, u'processed_metadata': { u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>), u'search_queries': [ u'$AAPL', u'$MSFT', u'$GOOG'], u'source': u'Twitter Streaming API'}, u'profile_background_color': u'4f4f4f', u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg', u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg', u'profile_background_tile': False, u'profile_image_url': u'http://a0.twimg.com/profile_images/2039368182/twitpic1_normal.png', u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2039368182/twitpic1_normal.png', u'profile_link_color': u'bd0000', u'profile_sidebar_border_color': u'eeeeee', u'profile_sidebar_fill_color': u'efefef', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'speculatethemkt', u'show_all_inline_media': True, u'statuses_count': 492, u'time_zone': u'Pacific Time (US & Canada)', u'url': u'http://www.speculatethemarkets.com', u'utc_offset': -28800, u'verified': False}}
I am trying convert this kind of 3/4 levels of nested json into python dataframe with every attribute present in it. I am able to extract up to 2 levels. How can i do for rest?