Handling of nested JSON records #1067

wesm closed 11 years ago

wesm commented 12 years ago
Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON).



{   u'_id': ObjectId('4f8b95e8a504d022e2000000'),
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Mon Apr 16 03:45:44 +0000 2012',
    u'entities': {   u'hashtags': [],
                     u'urls': [   {   u'display_url': u'',
                                      u'expanded_url': u'',
                                      u'indices': [72, 92],
                                      u'url': u''}],
                     u'user_mentions': []},
    u'favorited': False,
    u'geo': None,
    u'id': 191734090783916032L,
    u'id_str': u'191734090783916032',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'place': None,
    u'possibly_sensitive': False,
    u'possibly_sensitive_editable': True,
    u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                               u'search_queries': [   u'$AAPL',
                               u'source': u'Twitter Streaming API'},
    u'retweet_count': 0,
    u'retweeted': False,
    u'source': u'<a href="" rel="nofollow">StockTwits Web</a>',
    u'text': u'Interesting infographic on the internet and evolution of social media \u2794 $FB $GOOG $TWIT $LNKD $AOL',
    u'truncated': False,
    u'user': {   u'_id': u'speculatethemkt',
                 u'contributors_enabled': False,
                 u'created_at': u'Tue Nov 30 02:28:20 +0000 2010',
                 u'default_profile': False,
                 u'default_profile_image': False,
                 u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794",
                 u'favourites_count': 1,
                 u'follow_request_sent': None,
                 u'followers_count': 19658,
                 u'following': None,
                 u'friends_count': 596,
                 u'geo_enabled': False,
                 u'id': 221226895,
                 u'id_str': u'221226895',
                 u'is_translator': False,
                 u'lang': u'en',
                 u'listed_count': 6,
                 u'location': u'Portland, OR',
                 u'name': u'Jared M.',
                 u'notifications': None,
                 u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                                            u'search_queries': [   u'$AAPL',
                                            u'source': u'Twitter Streaming API'},
                 u'profile_background_color': u'4f4f4f',
                 u'profile_background_image_url': u'',
                 u'profile_background_image_url_https': u'',
                 u'profile_background_tile': False,
                 u'profile_image_url': u'',
                 u'profile_image_url_https': u'',
                 u'profile_link_color': u'bd0000',
                 u'profile_sidebar_border_color': u'eeeeee',
                 u'profile_sidebar_fill_color': u'efefef',
                 u'profile_text_color': u'333333',
                 u'profile_use_background_image': True,
                 u'protected': False,
                 u'screen_name': u'speculatethemkt',
                 u'show_all_inline_media': True,
                 u'statuses_count': 492,
                 u'time_zone': u'Pacific Time (US & Canada)',
                 u'url': u'',
                 u'utc_offset': -28800,
                 u'verified': False}}
hayd commented 11 years ago


jreback commented 11 years ago

this is invalid JSON (according to jsonlint), but generalized inference is IMHO too complicated, but #3804 should be able to do some of this. close this issue?

hayd commented 11 years ago

Is it feasible to grab the user section, (actually this example from the other thread is better):

Convert the (data, posts) section to DataFrame

s = r'''{
    "status": "success",
    "data": {
        "posts": [
                "id": 1,
                "title": "A blog post",
                "body": "Some useful content"
                "id": 2,
                "title": "Another blog post",
                "body": "More content"

read_json(s, grab_nest=(data, posts)) # some better argument name
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post
jreback commented 11 years ago

This is probably getting too cute.....

In [52]: def extract(df, l):
   ....:     for e in l:
   ....:         df = df[e]
   ....:     return df

In [54]: DataFrame.extract = extract

In [56]: DataFrame(pd.read_json(s).extract(['data','posts']))
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post
hayd commented 11 years ago

Ha! Perhaps less overhead to do pd.DataFrame(extract(, ('data', 'posts'))), but in either case we would lose the datetime parsing atm.

hayd commented 11 years ago

This could be a reasonably ok solution... tricky with orient (?), then parse_dates or whatever?

...other choice is just to loads/dumps/parse? :s

hayd commented 11 years ago

here's a bigish nested json: ~200mb

I think I'd want to extract ['features'].

I'm on an incredibly old macbook air, hence slow timings:

In [9]: %time with open('citylots.json', 'r') as f: pd.read_json(, ['features'])))
CPU times: user 29.13 s, sys: 28.27 s, total: 57.40 s
Wall time: 304.71 s

In [10]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(, ['features']))
CPU times: user 11.96 s, sys: 11.79 s, total: 23.75 s
Wall time: 136.50 s

In [11]: %time with open('citylots.json', 'r') as f: pd.read_json( times: user 13.47 s, sys: 10.41 s, total: 23.88 s
Wall time: 77.47 s

What is an extreme for reading in json?

jreback commented 11 years ago

After I figured out all I needed to do was clone the repository! (these also include full dtype conversions) FYI

In [3]: %time with open('citylots.json', 'r') as f: pd.read_json(, ['features'])))
CPU times: user 13.03 s, sys: 0.50 s, total: 13.53 s
Wall time: 15.12 s

In [6]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(, ['features']))
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.13 s

In [7]: %time with open('citylots.json', 'r') as f: pd.read_json(
CPU times: user 6.27 s, sys: 0.16 s, total: 6.44 s
Wall time: 6.45 s
jreback commented 11 years ago

see #3876

hayd commented 11 years ago

Wow, I should never do any data analysis on that laptop... (sorry I forgot that you had to clone it).

But what I mean is, you'd lose the control from the read_json arguments. Will be interesting to see if this use case comes up a lot "in the wild".

... really this really this kind of stuff should be done the other end, e.g. with (getting the _source directly).

wesm commented 11 years ago

I have a JSON normalization function I can clean up and make a PR before before anyone goes crazy writing one to save you some time. It would be nice to have a higher performance one at some point though

nehalecky commented 11 years ago

Hey @wesm and @hayd! I've been keeping my eye on this thread for a few days now—really impressive all the work that built up to this, thank you. Anyways, I though you might know that I could go crazy writing some JSON normalization soon. Perhaps I should wait? :)

Thanks for all.

hayd commented 11 years ago

Related:!topic/pydata/XkiWtZKT698 (json is a list of nested dictionaries...)

Hey @nehalecky , I think @wesm says he has something in the works already, so perhaps if you can hold off til he's pushed, then you could hack that? :)

I am trying convert this kind of 3/4 levels of nested json into python dataframe with every attribute present in it. I am able to extract up to 2 levels. How can i do for rest?