R package for working with Frictionless Data Package.
Package
class for working with data packagesResource
class for working with data resourcesProfile
class for working with profilesvalidate
function for validating data package descriptorsinfer
function for inferring data package descriptorsIn order to install the latest distribution of R software to your computer you have to select one of the mirror sites of the Comprehensive R Archive Network, select the appropriate link for your operating system and follow the wizard instructions.
For windows users you can:
(Mac) OS X and Linux users may need to follow different steps depending on their system version to install R successfully and it is recommended to read the instructions on CRAN site carefully.
Even more detailed installation instructions can be found in R Installation and Administration manual.
To install RStudio, you can download RStudio Desktop with Open Source License and follow the wizard instructions:
To install the datapackage
package it is necessary to install first
devtools package to make
installation of github packages available.
# Install devtools package if not already
install.packages("devtools")
Install datapackage.r
# And then install the development version from github
devtools::install_github("frictionlessdata/datapackage-r")
# load the package using
library(datapackage.r)
Code examples in this readme requires R 3.3 or higher, You could see even more examples in vignettes directory.
descriptor <- '{
"resources": [
{
"name": "example",
"profile": "tabular-data-resource",
"data": [
["height", "age", "name"],
[180, 18, "Tony"],
[192, 32, "Jacob"]
],
"schema": {
"fields": [
{"name": "height", "type": "integer" },
{"name": "age", "type": "integer" },
{"name": "name", "type": "string" }
]
}
}
]
}'
dataPackage <- Package.load(descriptor)
dataPackage
## <Package>
## Public:
## addResource: function (descriptor)
## clone: function (deep = FALSE)
## commit: function (strict = NULL)
## descriptor: active binding
## errors: active binding
## getResource: function (name)
## infer: function (pattern)
## initialize: function (descriptor = list(), basePath = NULL, strict = FALSE,
## profile: active binding
## removeResource: function (name)
## resourceNames: active binding
## resources: active binding
## save: function (target, type = "json")
## valid: active binding
## Private:
## basePath_: C:/Users/akis_/Documents/datapackage-r
## build_: function ()
## currentDescriptor_: list
## currentDescriptor_json: NULL
## descriptor_: NULL
## errors_: list
## nextDescriptor_: list
## pattern_: NULL
## profile_: Profile, R6
## resources_: list
## resources_length: NULL
## strict_: FALSE
resource <- dataPackage$getResource('example')
# convert to json and add indentation with jsonlite prettify function
jsonlite::prettify(helpers.from.list.to.json(resource$read()))
## [
## [
## 180,
## 18,
## "Tony"
## ],
## [
## 192,
## 32,
## "Jacob"
## ]
## ]
##
Json objects are not included in R base data types. Jsonlite package is internally used to convert json data to list objects. The input parameters of functions could be json strings, files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired. The examples below show how to use jsonlite package to convert the output back to json adding indentation whitespace. More details about handling json you can see jsonlite documentation or vignettes here.
A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.
Consider we have some local csv
files in a data
directory. Let’s
create a data package based on this data using a Package
class:
inst/extdata/readme_example/cities.csv
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,"41.89,12.51"
inst/extdata/readme_example/population.csv
city,year,population
london,2017,8780000
paris,2017,2240000
rome,2017,2860000
First we create a blank data package:
dataPackage <- Package.load()
Now we’re ready to infer a data package descriptor based on data files
we have. Because we have two csv files we use glob pattern csv
:
jsonlite::toJSON(dataPackage$infer('csv'), pretty = TRUE)
## {
## "profile": ["tabular-data-package"],
## "resources": [
## {
## "path": ["cities.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["cities"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## },
## {
## "path": ["population.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["population"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["year"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["population"],
## "type": ["integer"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## }
## ]
## }
jsonlite::toJSON(dataPackage$descriptor, pretty = TRUE)
## {
## "profile": ["tabular-data-package"],
## "resources": [
## {
## "path": ["cities.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["cities"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## },
## {
## "path": ["population.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["population"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["year"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["population"],
## "type": ["integer"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## }
## ]
## }
An infer
method has found all our files and inspected it to extract
useful metadata like profile, encoding, format, Table Schema etc. Let’s
tweak it a little bit:
dataPackage$descriptor$resources[[2]]$schema$fields[[2]]$type <- 'year'
dataPackage$commit()
## [1] TRUE
dataPackage$valid
## [1] TRUE
Because our resources are tabular we could read it as a tabular data:
jsonlite::toJSON(dataPackage$getResource("population")$read(keyed = TRUE),auto_unbox = FALSE,pretty = TRUE)
## [
## {
## "city": ["london"],
## "year": [2017],
## "population": [8780000]
## },
## {
## "city": ["paris"],
## "year": [2017],
## "population": [2240000]
## },
## {
## "city": ["rome"],
## "year": [2017],
## "population": [2860000]
## }
## ]
Let’s save our descriptor on the disk. After it we could update our
datapackage.json
as we want, make some changes etc:
dataPackage.save('datapackage.json')
To continue the work with the data package we just load it again but
this time using local datapackage.json
:
dataPackage <- Package.load('datapackage.json')
# Continue the work
It was one basic introduction to the Package
class. To learn more
let’s take a look on Package
class API reference.
A class for working with data resources. You can read or iterate tabular
resources using the iter/read
methods and all resource as bytes using
rowIter/rowRead
methods.
Consider we have some local csv file. It could be inline data or remote
link - all supported by Resource
class (except local files for
in-brower usage of course). But say it’s cities.csv
for now:
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let’s create and read a resource. We use static Resource$load
method
instantiate a resource. Because resource is tabular we could use
resourceread
method with a keyed
option to get an list of keyed
rows:
resource <- Resource.load('{"path": "cities.csv"}')
resource$tabular
## [1] TRUE
jsonlite::toJSON(resource$read(keyed = TRUE), pretty = TRUE)
## [
## {
## "city": ["london"],
## "location": ["\"51.50 -0.11\""]
## },
## {
## "city": ["paris"],
## "location": ["\"48.85 2.30\""]
## },
## {
## "city": ["rome"],
## "location": ["\"41.89 12.51\""]
## }
## ]
As we could see our locations are just a strings. But it should be
geopoints. Also Rome’s location is not available but it’s also just a
N/A
string instead of null
. First we have to infer resource
metadata:
jsonlite::toJSON(resource$infer(), pretty = TRUE)
## {
## "path": ["cities.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["cities"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## }
jsonlite::toJSON(resource$descriptor, pretty = TRUE)
## {
## "path": ["cities.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["cities"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## }
# resource$read( keyed = TRUE )
# # Fails with a data validation error
Let’s fix not available location. There is a missingValues
property in
Table Schema specification. As a first try we set missingValues
to
N/A
in resource$descriptor.schema
. Resource descriptor could be
changed in-place but all changes should be commited by
resource$commit()
:
resource$descriptor$schema$missingValues <- 'N/A'
resource$commit()
## [1] TRUE
resource$valid # FALSE
## [1] FALSE
resource$errors
## [[1]]
## [1] "Descriptor validation error:\n data.schema.missingValues - is the wrong type"
As a good citiziens we’ve decided to check out recource descriptor
validity. And it’s not valid! We should use an list for missingValues
property. Also don’t forget to have an empty string as a missing value:
resource$descriptor$schema[['missingValues']] <- list('', 'N/A')
resource$commit()
## [1] TRUE
resource$valid # TRUE
## [1] TRUE
All good. It looks like we’re ready to read our data again:
jsonlite::toJSON(resource$read( keyed = TRUE ), pretty = TRUE)
## [
## {
## "city": ["london"],
## "location": ["\"51.50 -0.11\""]
## },
## {
## "city": ["paris"],
## "location": ["\"48.85 2.30\""]
## },
## {
## "city": ["rome"],
## "location": ["\"41.89 12.51\""]
## }
## ]
Now we see that: - locations are lists with numeric lattide and
longitude - Rome’s location is a native JavaScript null
And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let’s save our resource descriptor:
resource$save('dataresource.json')
Let’s check newly-crated dataresource.json
. It contains path to our
data file, inferred metadata and our missingValues
tweak:
{
"path": "data.csv",
"profile": "tabular-data-resource",
"encoding": "utf-8",
"name": "data",
"format": "csv",
"mediatype": "text/csv",
"schema": {
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
}
If we decide to improve it even more we could update the
dataresource.json
file and then open it again using local file name:
resource <- Resource.load('dataresource.json')
# Continue the work
It was one basic introduction to the Resource
class. To learn more
let’s take a look on Resource
class API reference.
A component to represent JSON Schema profile from Profiles Registry:
profile <- Profile.load('data-package')
profile$name # data-package
## [1] "data-package"
profile$jsonschema # List of JSON Schema contents
valid_errors <- profile$validate(descriptor)
valid <- valid_errors$valid # TRUE if valid descriptor
valid
## [1] TRUE
A standalone function to validate a data package descriptor:
valid_errors <- validate('{"name": "Invalid Datapackage"}')
A standalone function to infer a data package descriptor.
descriptor <- infer("csv",basePath = '.')
jsonlite::toJSON(descriptor, pretty = TRUE)
## {
## "profile": ["tabular-data-package"],
## "resources": [
## {
## "path": ["cities.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["cities"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## },
## {
## "path": ["population.csv"],
## "profile": ["tabular-data-resource"],
## "encoding": ["utf-8"],
## "name": ["population"],
## "format": ["csv"],
## "mediatype": ["text/csv"],
## "schema": {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["year"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["population"],
## "type": ["integer"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
## }
## ]
## }
The package supports foreign keys described in the Table
Schema
specification. It means if your data package descriptor use
resources[]$schema$foreignKeys
property for some resources a data
integrity will be checked on reading operations.
Consider we have a data package:
DESCRIPTOR <- '{
"resources": [
{
"name": "teams",
"data": [
["id", "name", "city"],
["1", "Arsenal", "London"],
["2", "Real", "Madrid"],
["3", "Bayern", "Munich"]
],
"schema": {
"fields": [
{"name": "id", "type": "integer"},
{"name": "name", "type": "string"},
{"name": "city", "type": "string"}
],
"foreignKeys": [
{
"fields": "city",
"reference": {"resource": "cities", "fields": "name"}
}
]
}
}, {
"name": "cities",
"data": [
["name", "country"],
["London", "England"],
["Madrid", "Spain"]
]
}
]
}'
Let’s check relations for a teams
resource:
package <- Package.load(DESCRIPTOR)
teams <- package$getResource('teams')
teams$checkRelations()
## Error: Foreign key 'city' violation in row '4'
# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"
As we could see there is a foreign key violation. That’s because our
lookup table cities
doesn’t have a city of Munich
but we have a team
from there. We need to fix it in cities
resource:
package$descriptor$resources[[2]]$data <- rlist::list.append(package$descriptor$resources[[2]]$data, list('Munich', 'Germany'))
package$commit()
## [1] TRUE
teams <- package$getResource('teams')
teams$checkRelations()
## [1] TRUE
# TRUE
Fixed! But not only a check operation is available. We could use
relations
argument for resource$iter/read
methods to dereference a
resource relations:
jsonlite::toJSON(teams$read(keyed = TRUE, relations = FALSE), pretty = TRUE)
## [
## {
## "id": [1],
## "name": ["Arsenal"],
## "city": ["London"]
## },
## {
## "id": [2],
## "name": ["Real"],
## "city": ["Madrid"]
## },
## {
## "id": [3],
## "name": ["Bayern"],
## "city": ["Munich"]
## }
## ]
Instead of plain city name we’ve got a dictionary containing a city
data. These resource$iter/read
methods will fail with the same as
resource$check_relations
error if there is an integrity issue. But
only if relations = TRUE
flag is passed.
Package representation
Boolean
List.<Error>
Profile
Object
List.<Resoruce>
List.<string>
Resource
| null
Resource
Resource
| null
Object
Boolean
Package
Boolean
Validation status
It always true
in strict mode.
Returns: Boolean
- returns validation status
List.<Error>
Validation errors
It always empty in strict mode.
Returns: List.<Error>
- returns validation errors
Profile
Profile
Object
Descriptor
Returns: Object
- schema descriptor
List.<Resoruce>
Resources
List.<string>
Resource names
Resource
| null
Return a resource
Returns: Resource
| null
- resource
instance if exists
Param | Type |
---|---|
name | string |
Resource
Add a resource
Returns: Resource
- added resource instance
Param | Type |
---|---|
descriptor | Object |
Resource
| null
Remove a resource
Returns: Resource
| null
- removed
resource instance if exists
Param | Type |
---|---|
name | string |
Object
Infer metadata
Param | Type | Default |
---|---|---|
pattern | string |
false |
Boolean
Update package instance if there are in-place changes in the descriptor.
Returns: Boolean
- returns true on success and false if
not modified
Throws:
DataPackageError
raises any error occurred in the
processParam | Type | Description |
---|---|---|
strict | boolean |
alter strict mode for further work |
Example
dataPackage <- Package.load('{
"name": "package",
"resources": [{"name": "resource", "data": ["data"]}]
}')
dataPackage$descriptor$name # package
## [1] "package"
dataPackage$descriptor$name <- 'renamed-package'
dataPackage$descriptor$name # renamed-package
## [1] "renamed-package"
dataPackage$commit()
## [1] TRUE
Save data package to target destination.
If target path has a zip file extension the package will be zipped and saved entirely. If it has a json file extension only the descriptor will be saved.
Param | Type | Description |
---|---|---|
target | string |
path where to save a data package |
raises | DataPackageError |
error if something goes wrong |
returns | boolean |
true on success |
Package
Factory method to instantiate Package
class.
This method is async and it should be used with await keyword or as a
Promise
.
Returns: Package
- returns data package
class instance
Throws:
DataPackageError
raises error if something goes wrongParam | Type | Description |
---|---|---|
descriptor | string | Object |
package descriptor as local path, url or object. If ththe path has a zip file extension it will be unzipped to the temp directory first. |
basePath | string |
base path for all relative paths |
strict | boolean |
strict flag to alter validation behavior. Setting it to true leads to throwing errors on any operation with invalid descriptor |
Resource representation
Boolean
List.<Error>
Profile
Object
string
boolean
boolean
boolean
boolean
boolean
List
|
string
List.<string>
tableschema.Schema
AsyncIterator
|
Stream
List.<List>
|
List.<Object>
boolean
Iterator
|
Stream
Buffer
Object
boolean
boolean
Resource
Boolean
Validation status
It always true
in strict mode.
Returns: Boolean
- returns validation status
List.<Error>
Validation errors
It always empty in strict mode.
Returns: List.<Error>
- returns validation errors
Profile
Profile
Object
Descriptor
Returns: Object
- schema descriptor
string
Name
boolean
Whether resource is inline
boolean
Whether resource is local
boolean
Whether resource is remote
boolean
Whether resource is multipart
boolean
Whether resource is tabular
List
| string
Source
Combination of resource.source
and
resource.inline/local/remote/multipart
provides predictable interface
to work with resource data.
List.<string>
Headers
Only for tabular resources
Returns: List.<string>
- data source headers
tableschema.Schema
Schema
Only for tabular resources
AsyncIterator
| Stream
Iterate through the table data
Only for tabular resources
And emits rows cast based on table schema (async for loop). With a
stream
flag instead of async iterator a Node stream will be returned.
Data casting can be disabled.
Returns: AsyncIterator
| Stream
- async
iterator/stream of rows: - [value1, value2]
- base -
{header1: value1, header2: value2}
- keyed -
[rowNumber, [header1, header2], [value1, value2]]
- extended
Throws:
TableSchemaError
raises any error occurred in this
processParam | Type | Description |
---|---|---|
keyed | boolean |
iter keyed rows |
extended | boolean |
iter extended rows |
cast | boolean |
disable data casting if false |
forceCast | boolean |
instead of raising on the first row with cast error return an error object to replace failed row. It will allow to iterate over the whole data file even if it’s not compliant to the schema. Example of output stream: [['val1', 'val2'], TableSchemaError, ['val3', 'val4'], ...] |
relations | boolean |
if true foreign key fields will be checked and resolved to its references |
stream | boolean |
return Node Readable Stream of table rows |
List.<List>
| List.<Object>
Read the table data into memory
Only for tabular resources; the API is the same as
resource.iter
has except for:
Returns: List.<List>
|
List.<Object>
- list of rows: - [value1, value2]
-
base - {header1: value1, header2: value2}
- keyed -
[rowNumber, [header1, header2], [value1, value2]]
- extended
Param | Type | Description |
---|---|---|
limit | integer |
limit of rows to read |
boolean
It checks foreign keys and raises an exception if there are integrity issues.
Only for tabular resources
Returns: boolean
- returns True if no issues
Throws:
DataPackageError
raises if there are integrity issuesIterator
| Stream
Iterate over data chunks as bytes. If stream
is true Node Stream will
be returned.
Returns: Iterator
| Stream
- returns
Iterator/Stream
Param | Type | Description |
---|---|---|
stream | boolean |
Node Stream will be returned |
Buffer
Returns resource data as bytes.
Returns: Buffer
- returns Buffer with resource data
Object
Infer resource metadata like name, format, mediatype, encoding, schema and profile.
It commits this changes into resource instance.
Returns: Object
- returns resource descriptor
boolean
Update resource instance if there are in-place changes in the descriptor.
Returns: boolean
- returns true on success and false if
not modified
Throws:
Param | Type | Description |
---|---|---|
strict | boolean |
alter strict mode for further work |
boolean
Save resource to target destination.
For now only descriptor will be saved.
Returns: boolean
- returns true on success
Throws:
DataPackageError
raises error if something goes wrongParam | Type | Description |
---|---|---|
target | string |
path where to save a resource |
Resource
Factory method to instantiate Resource
class.
This method is async and it should be used with await keyword or as a
Promise
.
Returns: Resource
- returns resource class
instance
Throws:
DataPackageError
raises error if something goes wrongParam | Type | Description |
---|---|---|
descriptor | string | Object |
resource descriptor as local path, url or object |
basePath | string |
base path for all relative paths |
strict | boolean |
strict flag to alter validation behavior. Setting it to true leads to throwing errors on any operation with invalid descriptor |
Profile representation
string
Object
Object
Profile
string
Name
Object
JsonSchema
Object
Validate a data package descriptor
against the profile.
Returns: Object
- returns a {valid, errors}
object
Param | Type | Description |
---|---|---|
descriptor | Object |
retrieved and dereferenced data package descriptor |
Profile
Factory method to instantiate Profile
class.
This method is async and it should be used with await keyword or as a
Promise
.
Returns: Profile
- returns profile class
instance
Throws:
DataPackageError
raises error if something goes wrongParam | Type | Description |
---|---|---|
profile | string |
profile name in registry or URL to JSON Schema |
Object
This function is async so it has to be used with await
keyword or as a
Promise
.
Returns: Object
- returns a {valid, errors}
object
Param | Type | Description |
---|---|---|
descriptor | string | Object |
data package descriptor (local/remote path or object) |
Object
This function is async so it has to be used with await
keyword or as a
Promise
.
Returns: Object
- returns data package descriptor
Param | Type | Description |
---|---|---|
pattern | string |
glob file pattern |
Base class for the all DataPackage errors.
Base class for the all TableSchema errors.
The project follows the Open Knowledge International coding standards. There are common commands to work with the project.Recommended way to get started is to create, activate and load the package environment. To install package and development dependencies into active environment:
devtools::install_github("frictionlessdata/datapackage-r", dependencies=TRUE)
To make test:
test_that(description, {
expect_equal(test, expected result)
})
To run tests:
devtools::test()
more detailed information about how to create and run tests you can find in testthat package
In NEWS.md described only breaking and the most important changes. The full changelog could be found in nicely formatted commit history.