R and Pandas instructions for using Data Package

Mikanebu commented 7 years ago

User stories: As a Consumer [R user] I want to load a Data Package from R so that I can immediately start playing with it

As a Publisher I want to send a link to an R user colleague about how to use my data so that they can grab it and start using it

As a Consumer [Python user] I want to load a Data Package from Python using pandas, so that I can immediately start playing with it.

As a Publisher I want to send a link to Python user colleague about how to use my data so that they can grab it and start using it.

Acceptance Criteria

[x] When I visit a showcase page there are instructions for R users (that are easy to find and link to)
[x] When I visit a showcase page there are instructions for Python users (that are easy to find and link to)

Tasks

[x] Install R
- [x] Create template instructions (and test)
- [x] Embed in page
- [x] Make sure we can direct link to it!
[x] Install Python
- [x] Create template instructions (and test)
- [x] Embed in page
- [x] Make sure we can direct link to it!

Analysis

Instructions on using R in DataHub

In order to use Data Package in R follow instructions below:

install.packages("devtools")

library(devtools)
install_github("hadley/readr")
install_github("ropenscilabs/jsonvalidate")
install_github("ropenscilabs/datapkg")

#Load client
library(datapkg)

#Get Data Package
datapackage <- datapkg_read("https://bits.datapackaged.com/metadata/core/house-prices-us/_v/latest")

#Package info
print(datapackage)

#Open actual data in RStudio Viewer
View(datapackage$data$cities)

Instructions on using Pandas in DataHub

To generate Pandas data frames based on JSON Table Schema descriptors we have to install jsontableschema-pandas plugin. To load resources from a data package as Pandas data frames use datapackage.push_datapackage function. Storage works as a container for Pandas data frames.

In order to work with Data Packages in Pandas you need to install our packages:

$ pip install datapackage
$ pip install jsontableschema-pandas

To get Data Package run following code:

import datapackage

data_url = 'https://bits.datapackaged.com/metadata/core/s-and-p-500/_v/latest/datapackage.json'

# to load Data Package into storage
storage = datapackage.push_datapackage(data_url, 'pandas')
# to see datasets in this package
storage.buckets
# you can access datasets inside storage, e.g. the first one: 
storage[storage.buckets[0]]

Mikanebu commented 7 years ago

@rufuspollock please see analysis (instructions) above. We also discussed about implementation of this in front page. We can following:

<pre>
  <code>
    ...

    #Get Data Package
    datapackage <- datapkg_read("https://bits.datapackaged.com/metadata/" + {{ publisher }} + "/" + {{ package }} + "/_v/latest")

    ...

    #Open actual data in RStudio Viewer
    {% for resource in dataset.resources %}
      View(datapackage$data$"{{ resource.name }}")
    {% endfor %}
  </code>
</pre>

rufuspollock commented 7 years ago

We also discussed about implementation of this in front page. We can following:

@Mikanebu i don't think we want this on the front page - we want it on the data showcase page for each data package (and also in the docs potentially).

anuveyatsu commented 7 years ago

FIXED in https://github.com/frictionlessdata/dpr-api/commit/0581907b087db89a01deefb3d6d398c8c4e410bf

Screenshot:

openknowledge-archive / dpr-api