ropensci / git2rdata

An R package for storing and retrieving data.frames in git repositories.
https://ropensci.github.io/git2rdata/
GNU General Public License v3.0
99 stars 12 forks source link

Make use of Frictionless Table Schema to store metadata #66

Open peterdesmet opened 3 years ago

peterdesmet commented 3 years ago

Suggestion: rather than using a custom format to store metadata about fields and their data types, it might be worth looking into the Frictionless Table Schema. It is a specification to store information about tabular data as a json (or potentially yaml) file. The elements that are similar could be borrowed and it can be extended with the properties that are specifically needed for git2rdata.

Here's an snippet from an example (taken from https://github.com/inbo/datapackage/blob/b049504a1396bfddf7af7e595f5b856da02375d0/inst/extdata/datapackage.json#L133-L154)

          {
            "name": "count",
            "type": "integer",
            "constraints": {
              "required": false,
              "minimum": 1
            }
          },
          {
            "name": "age",
            "type": "string",
            "constraints": {
              "required": false,
              "enum": [
                "adult",
                "subadult",
                "juvenile",
                "offspring",
                "undefined"
              ]
          }
peterdesmet commented 3 years ago

Discussed on June 29. A non-invasive option for git2rdata would be a function that generates a datapackage.json file, which effectively makes a collection of file.tsvs a Data Package.

Example:

dep.tsv
dep.yml
obs.tsv
obs.yml

Call function:

make_datapackage(name = "my-dataset", license = "CC0-1.0", resources = c("dep.tsv", "obs.tsv"))

Result:

dep.tsv
dep.yml
obs.tsv
obs.yml
datapackage.json

With datapackage.json:

{
  "name": "my-dataset",
  "profile": "tabular-data-package",
  "licenses": { "name": "CC0-1.0" },
  "resources": [
    {
      "name": "dep",
      "path": "dep.tsv",
      "dialect": ...,
      "schema": ...
    },
    {
      "name": "obs",
      "path": "obs.tsv",
      "dialect": ...,
      "schema": ...
    }
  ]
}
florisvdh commented 3 years ago

Seems like a good idea to me.

It appears that the git2rdata .yml files would be kept out of the datapackage? While its main use is specific to R (variable type, git2rdata version etc), it also defines factor levels, which are only represented as numerical indices in .tsv. Are they contained in the json file as well?

peterdesmet commented 3 years ago

git2rdata .yml files would be kept out of the datapackage?

No, they can remain there.

Factors levels can be expressed as an enum in datapackage, but that only works if the values in the csv are the factor levels, not the factor indeces.

ThierryO commented 2 years ago

tsv files seems to be an unknown file format for the general public. Therefore I'm thinking to use it only with write_vc(optimize = TRUE). As the optimization requires either the git2rdata package to read it or an expert to figure it out, the tsv format is not an issue. For the version with write_vc(optimize = TRUE) we have IMHO two options. 1) Still keep it tab delimited but use the .txt file extension. 2) Switch to csv with , as separator and . as decimal point.

florisvdh commented 2 years ago

IMHO tab-delimited data has less reading and interpretation problems than csv, since tabs are rarely (or never) used within data (strings), while commas and semicolons often occur in data values (and both are used as separators in csv). So I think the choice to use tsv was a smart one!

Maybe the discussion then is about the file extension. Is a .tsv extension recognized as a tabular data file in Windows? One could make the choice for .txt / .tsv based on that.

ThierryO commented 2 years ago

The main problem is that some users don't recognise the tsv file. And hence do know what to do with it. Which file format are people more likely to recognise? And do they know how to open / import that format into e.g. Excel.

peterdesmet commented 2 years ago

The most recognizable is .csv. It is my preferred option for data, especially because the extension is almost synonymous with "data". I would reserve its use for data that are indeed comma delimited. @florisvdh I think using commas as a delimiter is fine: most programmes handle " escaped data values well.

If you want to stick with tab-delimited, you could opt for .tsv or .txt Both will be read out of the box by Excel (not tab-delimited .csv), but with the typical data and number handling issues that Excel has. GBIF downloads are tab-delimited .txt files. I think the main downside of using .txt is that it does not imply "data" the same way .csv does.

ElsLommelen commented 2 years ago

Admittedly, I also like to have the rarely used .tsv format for saving data with git2rdata, but for another reason as @florisvdh: users don't know it, so they are less eager to open and edit the files manually. And if they for some reason add data to a data repo without using write_vc(), they will use .csv, making these manually added files easily distinguishable. So the less known format prevents uninformed users from messing up stuff that was generated by git2rdata, and on the other hand it is all rather easy to use for informed users.

With uninformed users, I mean users that don't know git2rdata but do work with R a lot. People that hardy use R don't know .csv either, so for them it makes no difference, you have to explain anyway how they can open it. But users that do know R very well and are not informed on the fact that the git2rdata format is used to save the data, may just mess up a whole data repo when they are used to work with .csv without git2rdata. Of course it is possible to remove these commits afterwards, but it saves a lot of time (and frustration) for both maintainer and user if the user notices beforehand that these are not just .csv files. (I already came in this situation where a coworker fortunately asked some information beforehand, only because she was not familiar with .tsv.) And as always: don't suppose people read the the documentation before contributing, so I think it is good to use a less known format for git2rdata as a wake up call for contributors.

ThierryO commented 2 years ago

@ElsLommelen the idea is to have two flavours of data formats. The optimized version remains .tsv and is intended for the hardcore users that prefer efficiency over human readibility. The non-optimized version is intended for case where the file should be easy to read by a larger audience. Therefore I'll switch to .csv in such case.

Note that changes outside of git2rdata are detected, regardless the file format (tsv or csv). When you place the files under version control, you can always revert the changes made by the user. Note that updating the data without removing or adding variables or their order is possible by design.

peterdesmet commented 2 years ago

I think the frictionless R package is now mature enough (submitted for peer review, CRAN submission after that) to take the next step in implementing the function suggested in this issue in git2rdata.

See https://github.com/ropensci/git2rdata/issues/66#issuecomment-870600599 for the initial design discussion. With frictionless the following would be possible add a datapackage.json with the correct data type as such:

library(frictionless)
library(git2rdata)
library(magrittr)

# Create a data frame
df_original <- data.frame(
  id = c(as.integer(1), as.integer(2)),
  timestamp = c(
    as.POSIXct("2020-03-01 12:00:00", tz = "EET"),
    as.POSIXct("2020-03-01 18:45:00", tz = "EET")
  ),
  life_stage = factor(c("adult", "adult"), levels = c("adult", "juvenile"))
)

# Write to vc
git2rdata::write_vc(df_original, "df")

# Read with vc
df_returned <- git2rdata::read_vc("df")

# Create Frictionless Package
package <-
  create_package() %>%
  add_resource(
    resource_name = "df",
    data = "df.tsv",
    schema = create_schema(df_returned), # Use df_returned to pass on all data type properties
    delim = "\t"
  )

# Write Frictionless Data Package to disk
write_package(package) # This will not overwrite existing files

This can be wrapped in a function where one provides the data resources to be bundled:

make_datapackage(name = "my-dataset", license = "CC0-1.0", resources = c("dep.tsv", "obs.tsv"))