trickvi / datapackage

Manage and load dataprotocols.org Data Packages
GNU General Public License v3.0
27 stars 9 forks source link

Integration with pandas #22

Closed psychemedia closed 8 years ago

psychemedia commented 10 years ago

pandas seems to have quite a lot of traction at the moment, and provides a really powerful way of working with tabular datasets.

Having datapackage support as part of pandas.io could be really useful? Is anyone looking at getting datapackages into pandas?

trickvi commented 10 years ago

@psychemedia no, I don't think anyone has looked into it but I think it would be worth the effort. Are you interested?

psychemedia commented 10 years ago

I will try to up my game ito coding and see if I can do a proof of concept at least!

trickvi commented 10 years ago

That would be awesome! Just getting started is often the best motivator for finishing

psychemedia commented 10 years ago

Trivially, for a datapackage with a single data file:

import pandas as pd
def as_pandas_DataFrame( self ):
   _df = pd.DataFrame.from_dict(list(self.data))
   return _df

will create a pandas DataFrame from the the data field but then I think work needs to be done deciding how to map types identified in the JSON file onto pandas column datatypes?

psychemedia commented 10 years ago

I'm having trouble running DataPackage on py3.4 atm, so here's a doodle of minimal datapackage_from_gist loader that crudely loads "standard" csv data into pandas dataframes using core datapackage specified types (currently numeric, integer and string).

This is actually the inverse of what I'd originally imagined, which was to write a datapackage from one or more pandas dataframes (i.e. a pandas DataFrame write method).

import requests,json
from io import StringIO
import pandas as pd

def getGist(url):
    r=requests.get(url)
    gist=json.loads(r.text)
    return gist

def dataBundleFromGist(url):
    typeAlt={ 'number':float,'integer':int,'string':str}

    gist=getGist(url)
    databundle={}
    databundle['datapackage.json']=json.loads(gist['files']['datapackage.json']['content'])
    databundle['dataframes']={}

    for csv in databundle['datapackage.json']['resources']:
        dtypes={}
        for col in csv['schema']['fields']:
            dtypes[col['id']]=typeAlt[ col['type'] ]
        databundle['dataframes'][csv['path']] = pd.read_csv(StringIO( gist['files'][csv['path']]['content'] ),
                                                            dtype=dtypes)

    return databundle
femtotrader commented 9 years ago

Pandas shouldn't be a dependency (except if you have good reason to include it)

If Pandas is installed and user wants a Pandas DataFrame he should get it but if Pandas is not installed (or user doesn't want DataFrame it should return raw Python type (list, dict, strings, ...)

try:
    import pandas as pd
    _HAS_PANDAS = True
except ImportError:
    _HAS_PANDAS = False

Could be used

vitorbaptista commented 8 years ago

The new datapackage library (https://github.com/frictionlessdata/datapackage-py) doesn't explictly support exporting to Pandas as well. However, it should be easy to create a DataFrame from the dicts/lists that it exposes. Unless that flow is difficult, I'd rather keep the core tool simple.

I'm open for suggestions, though. Feel free to open an issue on https://github.com/frictionlessdata/datapackage-py about this and we can discuss over there. Meanwhile, I suggest we close this issue.

femtotrader commented 8 years ago

Maybe we should have a https://github.com/frictionlessdata/pandas-datapackage project which depends of https://github.com/frictionlessdata/datapackage-py

vitorbaptista commented 8 years ago

We do have a structure to import/export data to different storages (like BigQuery, SQLAlchemy, etc.). Pandas might be a special case, given it might be very simple to export, or maybe not. Regardless, I created a new issue to track this in https://github.com/frictionlessdata/datapackage-py/issues/73 so we can talk there and close this one.

trickvi commented 8 years ago

Since the package will be maintained by the frictionlessdata project where this issue is tracked I'm closing this issue.