Closed psychemedia closed 8 years ago
@psychemedia no, I don't think anyone has looked into it but I think it would be worth the effort. Are you interested?
I will try to up my game ito coding and see if I can do a proof of concept at least!
That would be awesome! Just getting started is often the best motivator for finishing
Trivially, for a datapackage with a single data file:
import pandas as pd
def as_pandas_DataFrame( self ):
_df = pd.DataFrame.from_dict(list(self.data))
return _df
will create a pandas DataFrame from the the data field but then I think work needs to be done deciding how to map types identified in the JSON file onto pandas column datatypes?
I'm having trouble running DataPackage on py3.4 atm, so here's a doodle of minimal datapackage_from_gist loader that crudely loads "standard" csv data into pandas dataframes using core datapackage specified types (currently numeric, integer and string).
This is actually the inverse of what I'd originally imagined, which was to write a datapackage from one or more pandas dataframes (i.e. a pandas DataFrame write method).
import requests,json
from io import StringIO
import pandas as pd
def getGist(url):
r=requests.get(url)
gist=json.loads(r.text)
return gist
def dataBundleFromGist(url):
typeAlt={ 'number':float,'integer':int,'string':str}
gist=getGist(url)
databundle={}
databundle['datapackage.json']=json.loads(gist['files']['datapackage.json']['content'])
databundle['dataframes']={}
for csv in databundle['datapackage.json']['resources']:
dtypes={}
for col in csv['schema']['fields']:
dtypes[col['id']]=typeAlt[ col['type'] ]
databundle['dataframes'][csv['path']] = pd.read_csv(StringIO( gist['files'][csv['path']]['content'] ),
dtype=dtypes)
return databundle
Pandas shouldn't be a dependency (except if you have good reason to include it)
If Pandas is installed and user wants a Pandas DataFrame he should get it but if Pandas is not installed (or user doesn't want DataFrame it should return raw Python type (list, dict, strings, ...)
try:
import pandas as pd
_HAS_PANDAS = True
except ImportError:
_HAS_PANDAS = False
Could be used
The new datapackage library (https://github.com/frictionlessdata/datapackage-py) doesn't explictly support exporting to Pandas as well. However, it should be easy to create a DataFrame from the dicts/lists that it exposes. Unless that flow is difficult, I'd rather keep the core tool simple.
I'm open for suggestions, though. Feel free to open an issue on https://github.com/frictionlessdata/datapackage-py about this and we can discuss over there. Meanwhile, I suggest we close this issue.
Maybe we should have a https://github.com/frictionlessdata/pandas-datapackage project which depends of https://github.com/frictionlessdata/datapackage-py
We do have a structure to import/export data to different storages (like BigQuery, SQLAlchemy, etc.). Pandas might be a special case, given it might be very simple to export, or maybe not. Regardless, I created a new issue to track this in https://github.com/frictionlessdata/datapackage-py/issues/73 so we can talk there and close this one.
Since the package will be maintained by the frictionlessdata project where this issue is tracked I'm closing this issue.
pandas seems to have quite a lot of traction at the moment, and provides a really powerful way of working with tabular datasets.
Having datapackage support as part of
pandas.io
could be really useful? Is anyone looking at getting datapackages into pandas?