libdynd / dynd-python

Python exposure of dynd
http://libdynd.org
Other
120 stars 23 forks source link

Experiences with github data #369

Open mrocklin opened 9 years ago

mrocklin commented 9 years ago

OK, so I'm playing with some github data. First challenge. Find a correct datashape

Fortunately, the Python datashape has some (very non-robust) heuristics for this kind of thing

Here is a trace of a sanitized ipython session 14:30

    In [1]: import json

    In [2]: import gzip

    In [3]: from toolz.curried import *

    In [4]: f = gzip.open('2015-01-04-8.json.gz')

    In [5]: records = map(json.loads, f)

    In [6]: groups = groupby('type', records)

    In [7]: groups.keys()
    Out[7]: 
    [u'ReleaseEvent',
     u'PublicEvent',
     u'PullRequestReviewCommentEvent',
     u'ForkEvent',
     u'MemberEvent',
     u'PullRequestEvent',
     u'IssueCommentEvent',
     u'PushEvent',
     u'DeleteEvent',
     u'CommitCommentEvent',
     u'WatchEvent',
     u'IssuesEvent',
     u'CreateEvent',
     u'GollumEvent']

    In [8]: groups['PushEvent'][0]
    Out[8]: 
    {u'actor': {u'avatar_url': u'https://avatars.githubusercontent.com/u/5645229?',
      u'gravatar_id': u'',
      u'id': 5645229,
      u'login': u'zchan0',
      u'url': u'https://api.github.com/users/zchan0'},
     u'created_at': u'2015-01-04T08:00:02Z',
     u'id': u'2491976742',
     u'payload': {u'before': u'32d7c05c3b30a8a7bd978157234ec64341ca93e3',
      u'commits': [{u'author': {u'email': u'zchan0@users.noreply.github.com',
         u'name': u'Zchan'},
        u'distinct': True,
        u'message': u'Create README.md',
        u'sha': u'a585dd0ab0ded34e4a9cacd22ef664e75afc9e66',
        u'url': u'https://api.github.com/repos/zchan0/Calculator/commits/a585dd0ab0ded34e4a9cacd22ef664e75afc9e66'}],
      u'distinct_size': 1,
      u'head': u'a585dd0ab0ded34e4a9cacd22ef664e75afc9e66',
      u'push_id': 537858380,
      u'ref': u'refs/heads/master',
      u'size': 1},
     u'public': True,
     u'repo': {u'id': 28285661,
      u'name': u'zchan0/Calculator',
      u'url': u'https://api.github.com/repos/zchan0/Calculator'},
     u'type': u'PushEvent'}

    In [9]: from datashape import discover

    In [10]: discover(groups['PushEvent'][0])
    Out[10]: 
    dshape("""{
      actor: {
        avatar_url: string,
        gravatar_id: null,
        id: int64,
        login: string,
        url: string
        },
      created_at: datetime,
      id: int64,
      payload: {
        before: string,
        commits: 1 * {
          author: {email: string, name: string},
          distinct: bool,
          message: string,
          sha: string,
          url: string
          },
        distinct_size: int64,
        head: string,
        push_id: int64,
        ref: string,
        size: int64
        },
      public: bool,
      repo: {id: int64, name: string, url: string},
      type: string
      }""")

    In [11]: discover(groups['PushEvent'][1])
    Out[11]: 
    dshape("""{
      actor: {
        avatar_url: string,
        gravatar_id: null,
        id: int64,
        login: string,
        url: string
        },
      created_at: datetime,
      id: int64,
      payload: {
        before: string,
        commits: 1 * {
          author: {email: string, name: string},
          distinct: bool,
          message: string,
          sha: string,
          url: string
          },
        distinct_size: int64,
        head: string,
        push_id: int64,
        ref: string,
        size: int64
        },
      public: bool,
      repo: {id: int64, name: string, url: string},
      type: string
      }""")

    In [12]: discover(groups['PushEvent'][2])
    Out[12]: 
    dshape("""{
      actor: {
        avatar_url: string,
        gravatar_id: null,
        id: int64,
        login: string,
        url: string
        },
      created_at: datetime,
      id: int64,
      payload: {
        before: string,
        commits: 1 * {
          author: {email: string, name: string},
          distinct: bool,
          message: datetime,
          sha: string,
          url: string
          },
        distinct_size: int64,
        head: string,
        push_id: int64,
        ref: string,
        size: int64
        },
      public: bool,
      repo: {id: int64, name: string, url: string},
      type: string
      }""")

    In [13]: ds = """{
      actor: {
        avatar_url: string,
        gravatar_id: string,
        id: int64,
        login: string,
        url: string
        },
      created_at: datetime,
      id: int64,
      payload: {
        before: string,
        commits: var * {
          author: {email: string, name: string},
          distinct: bool,
          message: string,
          sha: string,
          url: string
          },
        distinct_size: int64,
        head: string,
        push_id: int64,
        ref: string,
        size: int64
        },
      public: bool,
      repo: {id: int64, name: string, url: string},
      type: string
      }"""

    In [14]: x = nd.array(groups['PushEvent'], dtype=ds)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-14-60b3a629bf27> in <module>()
    ----> 1 x = nd.array(groups['PushEvent'], dtype=ds)

    NameError: name 'nd' is not defined

    In [15]: from dynd import nd
    ^[[A^[[A
    In [16]: x = nd.array(groups['PushEvent'], dtype=ds)
    ---------------------------------------------------------------------------
    BroadcastError                            Traceback (most recent call last)
    <ipython-input-16-60b3a629bf27> in <module>()
    ----> 1 x = nd.array(groups['PushEvent'], dtype=ds)

    dynd/nd/array.pyx in dynd.nd.array.array.__init__ (/home/travis/build/libdynd/dynd-python/build/temp.linux-x86_64-2.7/array.cxx:1414)()

    BroadcastError: Input python dict has key "org", but no such field is in destination dynd type {actor : {avatar_url : string, gravatar_id : string, id : int64, login : string, url : string}, created_at : datetime, id : int64, payload : {before : string, commits : var * {author : {email : string, name : string}, distinct : bool, message : string, sha : string, url : string}, distinct_size : int64, head : string, push_id : int64, ref : string, size : int64}, public : bool, repo : {id : int64, name : string, url : string}, type : string}

Requests

  1. Can we optionally be blind to keys that I don't specify? Or can we improve the error message to recommend a proper datashape?
  2. Can we print out nicely delimited datashapes? There is code in the python datashape module that might facilitate that
  3. I wanted to specify the string datashape of the commit hashes as string["ascii", 40] but couldn't manage to find the right syntax (or perhaps this isn't supported)

@izaid just mentioned the following

Just thinking about that now, it seems like we'd have to do a pass of the entire dataset to discover the datashape

I am totally 100% ok with this. Performance is not yet even on my radar.

mrocklin commented 9 years ago

BTW I'm playing with this dataset

wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz
izaid commented 9 years ago

@mrocklin We've still got some ways to go on your requests, but ndt.json.discover is a start. I think the next step is bringing the JSON parser to Python and updating it correspondingly with some of the things you ask for. The parser is already written, so it's really just a case of adding some of the features you want in there.