ropensci / software-review

rOpenSci Software Peer Review.
285 stars 104 forks source link

git2rdata, a companion package for git2r #263

Closed ThierryO closed 4 years ago

ThierryO commented 5 years ago

Summary

Package: git2rdata
Title: Store and Retrieve Data.frames in a Git Repository
Version: 0.0.1.9000
Date: 2018-11-12
Authors@R: c(
  person(
    "Thierry", "Onkelinx", role = c("aut", "cre"), 
    email = "thierry.onkelinx@inbo.be", 
    comment = c(ORCID = "0000-0001-8804-4216")),
  person(
    "Research Institute for Nature and Forest",
    role = c("cph", "fnd"), email = "info@inbo.be"))
Description: Make versioning of data.frame easy and efficient using git repositories.
Depends: R (>= 3.4.0)
Imports:
  assertthat,
  git2r (>= 0.23.0),
  methods,
  readr
Suggests:
  knitr,
  rmarkdown,
  testthat
License: GPL-3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 6.1.1
URL: https://github.com/inbo/git2rdata
BugReports: https://github.com/inbo/git2rdata/issues
Collate:
    'write_vc.R'
    'auto_commit.R'
    'clean_data_path.R'
    'git2rdata-package.R'
    'meta.R'
    'read_vc.R'
    'recent_commit.R'
    'reexport.R'
    'rm_data.R'
VignetteBuilder: knitr

https://github.com/inbo/git2rdata

reproducibility, because it help to store dataframes as plain text files without loss of information, while minimize file sizes and diff.

Anyone who wants to work with medium sized dataframes and have them under version control. This is useful in case of recurrent analysis on growing datasets. E.g. each year new data is added to the data and a new report is created. A snapshot of the raw data can be stored as plain text files under version control.

The initial idea was to add this functionality into git2r. After a discussion with the maintainer, we decide to create a separate package. We use functions from read_r and improve them by storing the metadata.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

We use read_vc() rather than vc_read() because it matches read.table(), read_csv(), ...

@jennybc @stewid

sckott commented 5 years ago

thanks for your submission @ThierryO - editors are discussing now

noamross commented 5 years ago

Editor checks:


Editor comments

Thanks for your submission @ThierryO! This is an interesting package and it passes checks with nothing but this from goodpractice::gp():

  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.

I'm seeking reviewers for it now. FYI, because of the thorny nature of the problem your tackling - data versioning workflows - I anticipate a lot of discussion around high-level concepts and architecture. I'm going to look for reviewers with a lot of experience on these workflow issues, and encourage them to start that discussion before checking off all the low-level boxes so as to have an efficient and productive review.


Reviewers: @jennybc @brodieG Due date: 04-01-2019

noamross commented 5 years ago

Reviewers assigned: @jennybc and @brodieG. At reviewers' request, final review date is extended to deal with travel and holidays. However, I request that reviewers provide some comment on high-level workflow/architecture issues that I think are likely to be the main points of sooner, before before a more detailed look at the code. Roughly:

ThierryO commented 5 years ago

Thanks for assigning the reviewers. I'm looking forward for their comments. There were some minor changes to the package. The latest version is available at inbo/git2rdata#5

brodieG commented 5 years ago

Hi Thierry. This looks like an interesting package to review; looking forward to it.

One small request: do you mind creating tags/releases that we can work off of so there is no issue of the code changing under our feet as we are reviewing it? Ideally this would also be accompanied by a unique package version in DESCRIPTION. This will ensure we're all exactly on the same page about what version of the code is being reviewed. Thank you.

ThierryO commented 5 years ago

Good point. There is currently a single release tag v0.0.1.9000 which is the current HEAD of the master branch. I suggest that you use this for the review. I won't push to the master branch during the review process.

The current changes in the develop branch are mainly cosmetic, except one hotfix.

brodieG commented 5 years ago

Sounds good, I will do all of my work of of v0.0.1.9000.

brodieG commented 5 years ago

Overview

I haven't reviewed the internals of the package very closely yet, but overall I like the concept and the initial cut at the implementation. There is good attention to detail (e.g. helpful error messages), parameter checking, generally good documentation, and good test coverage of the key functions. These all demonstrate a commitment to robust programming practices.

At the request of the editor I'm going to focus on the top level questions first rather than go through the step by step review form. I did take the liberty to add some comments that fall outside of the scope of the questions asked, and would appreciate some feedback from @noamross on which points should fall under the scope of this review vs not. I'm finding this topic pretty interesting, but I don't want to over-burden the author by extending the review into areas that while interesting and related do not need to be addressed for the purposes of the review.

Editor Questions

Q1

Is the expected workflow well communicated and implemented in the documentation and vignette?

Vignette

The vignette provides end to end "toy" examples for both the versioned and unversioned workflows. These examples give a good sense of how the functions operate and how the user is expected to interact with them.

There are a few areas of a versioned workflow that would benefit from additional exploration:

Documentation

The man documentation for most of the functions I looked at do not have usage examples. They should to satisfy Ropensci's "multiple points of entry principle". Also, while I don't think this is a strict CRAN requirement, I have gotten feedback from CRAN maintainers that examples should be included.

Q2

How does the workflow and model of the package work with other frameworks for versioning and reproducibility? Are concepts compatible and separable?

I do not have extensive experience with "reproducible workflows" beyond package testing, so what follows is based on first principles rather than first hand knowledge.

It strikes me that the package is more intertwined with a git workflow than it needs to be. While write_vc and read_vc produce outputs that work well with version control, there is no real need for an explicit link. For example, if write_vc returned file paths instead of file hashes, we could just do:

git2r::add(path=write_vc(df, 'df'))

instead of:

write_vc(df, 'df', repo, stage=TRUE)

To me the semantics are much clearer in the first example, or at worst, just as good as the second. The big difference is that you can strip away a large amount of the complexity in your code. You could potentially even ditch S4 in favor of S3 as the version control would be explicitly selected by the user, e.g. in the case of an imaginary svnr package:

svnr::commit(write_cv(df, 'df'))

It might even be reasonable to completely remove the git2r dependency, with no real loss in functionality. I'll freely admit it is entirely possible I'm missing use cases where the tight integration with git2r is beneficial, but I don't think I've seen clear examples of that. Please feel free to point them out to me if they already exist and I missed them, or make some up for review.

One possible example is that we have to resolve the question of how to link the data to the code as I suggested in Q1, which would be a little easier to do with the repository object directly available, but I'm not sure that tilts the balance back to the more complex implementation. Also, detection of unversioned files would also be easier with the repository object. That said, neither of this functionality currently exists, and the complexity to support the explicit git integration does.

In terms of integrating with other vcs, the expectation seems to be implicitly that someone will write an S4 method that supports it. That's a reasonably tall ask when the alternative above is possible.

Aside: is there a particular value of the file hashes that I'm missing? I would imagine in most cases people wont care what they are, and for when they do, they can just call git2r::hashfile on the names anyway.

Q3

Is the package API (relatively) intuitive and work well with the workflow?

Yes, although there are some exceptions and possible improvements:

rm_data

The semantics of rm_data(..., type='tsv'), and rm_data(..., type='yml') are different. For the former the tsv data files are removed, whereas for the latter the yml files are removed, but only if there is not a corresponding tsv file. This is only indirectly documented via the vignette examples.

We would recommend that you split up the behavior across functions, possibly called:

You could document the difference in behavior in the function itself as a partial remedy, but the behavior is different enough that it feels as these should be different functions so the semantics are clear.

Another issue to consider is whether you should allow the removal of unversioned files without some kind of warning. I realize that if you unlink or file.remove in R you'd better be sure you know what you are doing, but particularly for data that is in a version controlled directory, but not itself under version control, aggressive file deletion might be undesirable behavior.

Versioned vs. Unversioned Modes

As mentioned in Q2, I think it is detrimental to try to intertwine these more than strictly necessary.

Possible Improvements

I would encourage you to create an intermediary git2rdata data type that inherits from data frame, possibly with the meta data attached as an attribute. Let's call the new data type git2rdat. We can then define (all in S3 here for convenience):

as.git2rdat.data.frame
write_vc.git2rdat
write_vc.data.frame      # calls as.git2rdat.data.frame and forward
all.equal.git2rdat

Which then simplify the workflow in the vignette:

all.equal(read_vc(file = "git/test", root = repo), x)

Is now equivalent to:

y <- read_vc(file = "git/test", root = repo)
sorted.x <- x[order(x$x), colnames(y)]
rownames(sorted.x) <- NULL
all.equal(sorted.x, y, check.attributes = FALSE)

Additional Comments

More Workflow Comments

I think it would be helpful to have more aggressive defaults to facilitate the workflow. For example, why not have the following:

write_vc <-
  function(x, file=deparse(substitute(x)), root=getOption('proj.root'))

Then things like the following just work:

git2r::add(write_vc(iris))
git2r::commit()

This would produce 'iris.tsv' and 'iris.yml' at getOption('proj.root').

Obviously you will need something more sophisticated than what I use above (e.g. to handle bad names or automatically detecting a project root, or setting a default data folder under the auto-detected project root), but I find the simplicity of the above compelling.

Implications of Versioning Data in Git

@jennybc probably has a good view on this already, so I'll defer to her input here, but while we wait for it here is my view on this:

Git is not really meant to store data frames, so I think it would be really useful to have a more systematic assessment of whether this is a Bad Idea or not, and what the limits are.

You mention "medium sized data". I suspect different people will have different ideas of what that means. I recommend that you do some analysis of what the implications are of repeatedly committing a data set of the maximal size you intend your package to be used with. You should examine the effects on the size and performance of the repository over enough commits you can provide guidelines to your users about what it is okay to commit, and what is probably too much. You should probably also compare the effects of commits with small changes, vs large changes.

It may be that there is actually no problem with this at all, but it is unusual enough that it seems worth covering. At a minimum, I would imagine prospective users to be curious about it. There seem to exist several frameworks that separate the data from the repository so there probably is some reason for concern.

Rationale

I think the vignette would benefit from a more thorough analysis of the rationale for the package. In the review issue you give the following as a potential application:

E.g. each year new data is added to the data and a new report is created. A snapshot of the raw data can be stored as plain text files under version control.

Based on my experience this is not a great reason to use version control. The data is effectively self versioned via the date variable so the version control is redundant (certainly, the old data could change too, but even that can be tracked explicitly in the data via a report ID for explicit analysis).

I feel that the version control is most useful for data that is not supposed to change. The reason for this is that when data is supposed to change, the natural thing to do is to analyze the changes to understand the meaning of the changes. Diffs are terrible tools for this. They will tell you that number A changed to number B, but will do so in a way that is very difficult to meaningfully contextualize. This applies to tabular data in general. Unless you are looking at a tiny data set, the table view is almost always useless. Even iris with its puny 150 rows is almost always viewed as plots or otherwise subjected to statistical computation.

On the other hand, for things that are not supposed to change, at least without explicit review, knowing what changed is typically the biggest part of the battle. There a diff based workflow can be very helpful.

I see two obvious applications:

Clearly you have these things in mind, but it might be worth being more explicit about the benefit of versioned data in the context of reproducible analysis, and then explicitly highlight workflows that do that.

Research Of Comparable Solutions

I found:

More generally, it might be worth doing some level of literature review to see what solutions exist already and whether they offer any insights on best practices to apply to your package.

Thomas's paper covers some of this, and also referencing daff which I was going to suggest separately. Edwin De Jonge wrote an R wrapper for it. I have not personally used it, but I've done a fair bit of diffing of data frames with line based diffs and honestly they do not work that well. Your idea of sorting rows/columns will help to some extent, but you should assess whether it makes sense to bring in a tabular diff.

Another idea raised in the same paper that I found interesting is to store data in long format. It could obviate the need for daff by making all changes row base, but will make the data larger. I don't know how much compression would help. This obviously introduces additional complexity that in the end may not justify itself, but I did find the idea of reducing the data to a form where row based diffs would work perfectly very appealing.

Also interesting:

I have not used any of these solutions, so I cannot vouch for their efficacy or safety.

Storage format

Dependencies

Do you really need readr? You are dealing with a special case where you know the origin of the data, so you are less likely to need the special features offered by readr. Also, if you're explicitly targeting medium sized data, do you need high performance on this? Unlike asserthat, readr carries along many of its own dependencies. I would strongly encourage you to consider whether write.table could do the job well enough, even if it requires a little extra work.

noamross commented 5 years ago

Thank you @brodieG for a pre-review as detailed as many of our full reviews! All of this is in-scope for the review. I think the discussion of other frameworks is quite helpful. One does not have to adopt a totally different model as a result, but I think using that context to guide the API and documentation . I find it increasingly common and helpful to put a "Related Work" section in README/vignettes to compare a package's approach to other options.

@ThierryO, I'd suggest waiting for @jennybc's input before making decisions and starting work on any major changes, but do respond to @brodieG's comments as you see fit.

ThierryO commented 5 years ago

Thanks you @brodieG for the elaborate review. Here are my idea's on the topics you raised.

Editors questions

Q1

Vignette

Adding an example where the data is actually used is a good suggestion. I have presented our workflow as a poster at the International Statistical Ecology Conference ISEC 2018. Please note that this poster predates the conception of git2rdata. A precursor of the functionality was then available in n2khelper.

Tracking analysis code is outside the scope of git2rdata. However, I do that in our n2kanalysis package. This package uses the output of git2rdata::recent_commit() as an entry point for the data. The code version is documented as a list of used packages and their version (include commit hash for github package). n2kanalysis has even a file and status fingerprints in each object which makes is possible to check the integrity of the object.

If we assume that the code to create and store the data.frames is all stored in a package foo, then gitr2data::auto_commit() can be used to document this. The commit message will read "Automated commit from 'foo'" and adds the full list of loaded packages and their versions at the moment of the commit. This is of course not tamper proof, but that is not the goal of git2rdata.

Documentation

If required, I can add some toy examples to the man documentation. How extensive should these examples be?

Q2

The initial idea of the package was to read and write data.frames into a git repository. Hence the strong link with Git and git2r. Then I had the idea that simply writing data.frames without version control (or to a different version control system than Git) might be relevant as well for other users.

Currently write_vc() returns hashes because I use these hashes in our workflow (they are used to calculate the file fingerprints). The file names are returned as the names of the hashes. So the examples with git2r::add() and svnr::commit() are currently possible when you write them as below. I agree that it is slightly more cumbersome. Returning a names vector with file paths and adding the hashes as names is an option.

git2r::add(path = names(write_vc(df, 'df')))
svnr::commit(names(write_cv(df, 'df')))

The current version has only methods depending on the root argument, making S3 a relevant option. Although it would require that root is the first and mandatory object. S4 allows to write methods for different arguments. E.g. we could think of methods for x objects that are something different than a data.frame (e.g. a SpatialPointsDataFrame).

Q3

rm_data

Possible improvement

I'm not sure if there is much benefit of creating a new class. We often just want to write the object to a file / vcs and don't bother about any intermediate formats.

The all.equal() example is rather verbose because I want to stress the fact that write_vc() alters the order of the rows and the columns. I'm not sure that the user would need an all.equal() method for git2rdata. This equality should be warranted by the package itself (and is through unit tests).

Additional comments

More Workflow Comments

There already is a PR which implements root = "." as default. Another option would be to use rprojroot::find_root() to find a suitable root. A sensible order of defaults would be is_git_root, is_svn_root, is_r_package, is_rstudio_project.

file = deparse(substitute(x)) is problematic as it becomes file = "." when piping is used, as demonstrated in the example below. Furthermore, it assumes that people use informative names for their object (which they seldom do). I prefer to force the user to think about naming the file.

library(dplyr)
test <- function(x, file = deparse(substitute(x))) {
  message(file)
}
test(iris)
iris %>%
  test()

Implications of Versioning Data in Git

A separate vignette with guidelines and caveat on using git2rdata on a Git repository would be useful indeed. I'd like to postpone this. This allows me to try it on larger data sets and see were it breaks.

Rationale

In my line of work, I don't have any control over the raw data source. And hence cannot guarantee that I can reconstruct the raw data. I've been repeating the same analyses every year on several data sources. At least two of them had a major overhaul of the database structure during the past five years. And there is no version control within the database itself.

In the past we published results which were questioned by one of our stakeholders. They disagreed with the results and feared that the results would lead to a detrimental change in policy. The simple fact that we were able state that both the data and the code was placed under version control instantly increased our trustworthiness. After a walk through of the data and the code, they now agree on the methodology and hence the results.

So the focus is not on "the analysis of what changed in the data", but rather on "what data was used in this version of the analysis". Therefore git2rdata is just a tool in a fully reproducible analytically pipeline, not the full pipeline.

Research Of Comparable Solutions

I'll have a look at these suggestions over the next weeks.

Storage format

Storing each object in a dedicated folder is an interesting suggestion. Another benefit of this, is that we could allow the user to specify "file" extensions. E.g. write_vc(x, file = "test.txt") would store the data in test.txt/data.tsv and test.txt/metadata.yml.

Dependencies

An earlier version of git2rdata relied on write.table() and read.table(). The downside of that is, that storing characters was somewhat limited: NA, \t and \n were not allowed when storing them unquoted.

Personally, I like tibble over data.frame because of the superior print() method. Adding only the tibble dependency would also add at least fansi, fillar and utf8. In case you omit the dependencies in 'Suggested:' it will also required cli, crayon and rlang.

While thinking on adding tibble, readr can into view. I agree that it again adds several dependencies: BH, hms and pkgconfig (plus R6 and Rcpp which are only dependencies of Suggested dependencies of git2rdata).

However, I see several gains in using readr

  1. read_vc() returns a tibble (agreed, rather a matter of taste than a critical improvement)
  2. read_tsv() is faster than read.table() as claimed by the readr README, Efficient R programming, this blog, ...
  3. write_tsv() handles NA and special characters efficiently by adding quotes only when really needed. And hence code that I don't have to write.
  4. read_tsv() allows to specify the column type when reading the data.

IMHO, these gains outweigh the extra dependencies.

noamross commented 5 years ago

@jennybc A reminder that your review was due last week, please get it in as soon as you can.

noamross commented 5 years ago

While waiting for the second review (which I think might take until after rstudio::conf), some other quick responses:

Examples

If required, I can add some toy examples to the man documentation. How extensive should these examples be?

There's a lot of flexibility in this, but generally, these should be enough to demonstrate minimal usage/syntax. I usually make sure to have one example with only required arguments and examples that together include all optional arguments, and that print/show examples of the output value.

Dependencies

We outline some low-dependency options for tibbles in the development guide, but if the author can justify the trade-offs for dependencies we defer to their choice.

ThierryO commented 5 years ago

@noamross: Our main arguments in favour of readr are:

  1. write_tsv() handles NA and special characters efficiently by adding quotes only when really needed. And hence code that I don't have to write.
  2. read_tsv() allows to specify the column type when reading the data.

Removing the readr dependency would require us to write a lot more code to get the same result, or to limit functionality. E.g. in earlier versions (without readr) we could not allow to write the string "NA" because storing it without quotes makes it indistinguishable from the missing string NA. Not allowing to store "NA" seemed a better choise than not allowing to store missing values in characters. readr has the option to add quotes only in cases were it is needed to eliminate ambiguity.

We'd like to get feedback from @noamross, @jennybc and @brodieG on this topic.

brodieG commented 5 years ago

@ThierryO, I haven't responded as I wanted to give @jennybc a chance to comment, but since you're asking explicitly, some responses to the narrow questions regarding readr. Here is some simple code that deals with I think most of the pathological inputs you're describing:

df <- data.frame(
  a=c(NA,"NA",'"NA"','""NA"'),
  b=c(1,2,3,'N\tA'),
  c=letters[1:4],
  d=runif(4),
  stringsAsFactors=FALSE
)
chr.types <- vapply(df, function(x) tail(class(x), 1), "")
chr.cols <- chr.types == 'character'
special <- logical(ncol(df))
special[chr.cols] <- vapply(df[chr.cols], function(x) any(grepl("\"|\t", x)), NA)

write.table(
  df, sep='\t', quote=which(special), 'df1.tsv', na='NANANA',
  qmethod='double', row.names=FALSE
)
read.table(
  'df1.tsv', sep='\t', stringsAsFactors=FALSE, na='NANANA', header=TRUE,
)

You will need to to generate a string that doesn't exist in the data to use as an NA string. A simple low-cost example is to find the length of the longest string with nchar and just use strrep('NA', len + 1), although that means you will also have to disallow strings longer than INT_MAX - 1 (32 bit) as INT_MAX is currently the string length limit in R.

This also quotes an entire column if any element contains either a tab or a quote. However, both these cases (and cases were there are columns that have strings that are INT_MAX long, which probably would be a terrible use case for this anyway) are likely to be rare.

Undoubtedly readr will be much more sophisticated about how it handles this business, but perhaps given these examples the added value of the sophistication may seem lower. What is not low is the weight of the dependencies.

If you really still feel that there is substantial added value to the readr functionality, I won't stand in the way of you including it. It just seems from where I stand that you can get most of what you're looking for without readr and its dependencies (aside: it seems readr could easily remove tibble as a dependency itself).

Re: returning tibbles, I don't see that as a positive. If someone likes tibbles and has the package installed, it is trivial for them to turn data frames into tibbles. Likewise if someone prefers data.tables. I think it makes more sense for package to be data structure neutral unless the data structure is integral to the package itself. Again, not something that would prevent me from approving the package, but chaining the package to a specific non-base data structure when it isn't necessary seems sub-optimal in the general use case.

If you do decide to leave readr as a dependency, then one thing you can do is record whether the input was a tibble or not, and then if not, return a normal data frame to the user.

EDIT: Forgot to mention this, read.table has a colClasses parameter that allows you to specify the classes of the columns in the tsv.

noamross commented 5 years ago

New reviewer added: @jtr13, review due 2018-02-05

jtr13 commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

Functionality

Final approval (post-review)


Review Comments

Note

In the interest of not delaying the review, I was not able to fully test all functionality claims.

Documentation comments

Statement of need

The first two paragraphs of the vignette provide a much more detailed explanation of the problem that the software is designed to solve than the README. I know there's been a lot of discussion about what should go where and the importance of not having to maintain duplicate text. That said, I am of the opinion that important information like this should be in the README since my sense is that vignettes are often overlooked. (More so now since build_vignettes = TRUE doesn't work).

Examples

I see that examples were added in the develop branch. More comments would be helpful to guide the user through the examples.

Community Guidelines

There is no CONTRIBUTING file nor contributing guidelines in the README. URL, BugReports and Maintainer are provided in DESCRIPTION file.

Functionality comments

I was not able to perform a thorough review to confirm functional and performance claims.

All code in the vignette worked as expected as did additional toy examples.

General comments

The package claims to address three needs:

1) version control of data

2) optimization of data storage

3) meta data storage to preserve data types, factor levels, etc.

Taking these in reverse order: storing meta data in a .yml file has clear and obvious benefits. Losing factor information in particular is a source of frustration for R users and the package provides functions that could be useful to even beginning R users.

The benefits of optimization are not as clearly apparent. It would be helpful to be provided with more context, that is, answers to questions such as: in typical use cases, how much smaller are the text files created by git2rdata functions? Is there a time tradeoff?

Finally, with regard to version control, the contribution of git2rdata--to the best of my understanding--is to sort rows and columns to minimize diffs. The work of actually maintaining version histories is performed by git, with git2r providing the interface to R.

It should be noted though that git2r is optional even for using the main git2rdata functions: write_vc and read_vc. Particularly for beginning users, using these functions with the RStudio git interface would be a simpler workflow.

Given this scenario, the package name, git2rdata is somewhat misleading and doesn't capture the breadth of contributions that the package makes. In addition, it limits flexibility in making these functions available to other version controls as suggested in other reviews. From the name it sounds as though the package would perform similar functions to git2r in a new domain, that is, add functionality for git commands that are not available in git2r. That, however, is not what it does. The vignette title "Storing Dataframes" is more on the mark; perhaps a name along the lines of storedf would be a better fit. (On a personal note, I've had a lot of trouble with git2r so I wonder if the link might scare off users who would otherwise benefit from this package.)

Some minor points: it's not clear what the benefit is of using rm_data() vs. other means to remove files or overwrite them.

If the sorting parameter isn't essential it would be helping to have a way to not use it (sorting = FALSE) or have some other default. For example, if all you care about is preserving factor information, it would be nice to have a simple workflow for doing so.

ThierryO commented 5 years ago

Thank you for your feedback, @jtr13.

Documentation comments

I'll update the README (see https://github.com/inbo/git2rdata/issues/12). A CONTRIBUTING.md file is present in the .github folder. A link will be added to the README.

I'll add a some comments to the examples in the help files.

General comments

I'll probably separate (and elaborate) the current vignette into several vignettes.

Making the sorting optional is OK when not storing the files under version control. However, then it is up to the user to use the sorting when writing to a file path and doing the version control outside git2rdata. Which is something they will probably forget and end up with potentially large diffs...

The rm_data() and prune_meta() are useful in an automated workflow where changes in the raw data can lead to the removal of data sets. First apply rm_data() to remove all tsv files. Then write all the data using write_vc(). This is (re)create the tsv files which are still needed. Since all tsv were removed at the start, the ones we don't need any more are no longer present. The final step is to use prune_meta() to remove leftover yml files with no matching tsv file.

Import, suggest or remove git2r?

The functionality of git2rdata seems a bit of overkill when not using version control. Storing dataframes as binary files using save() or saveRDS() is probably more efficient in such case. Git is a very popular version control system so most likely the one used in combination with git2rdata. Hence we think it will be more easily found when the name refers to git.

The git2r dependency is not strictly necessary when not using git. Or in an ad hoc workflow where the user manually writes data, stages the required files and creates the commits.

Another use case is an automated one with recurrent data storage and subsequent analysis. The user writes a script that queries a raw data source and stores the resulting dataframes under version control. Under such scenario, the user will most likely want to commit all dataframes written by the function. The user can use git2r::add(git2rdata::write_vc()), or the more convient and readable git2rdata::write_vc(stage = TRUE). The user probably also want to commit the changes at the end of script. One options is git2r::commit() potentially in combination with git2r::push(). Another option is to use git2rdata::auto_commit() or git2rdata::auto_commit(push = TRUE). The main benefit of auto_commit is that it creates a standardised commit message listing the loaded R packages and there versions, thus documenting the code that was used to create the commit. Suppose that the script is stored into a custom package. Then the message from auto_commit() is sufficient to document which code was used to create the commit. The code itself should document why and how the data is created.

One way to document the data used in subsequent analysis is to use the commit hash of the data that is checked out. In our use case, we have several different dataframes stored under version control. The same statistical model is fit to each of them. Not every dataframe will be updated in the next commit of the data. Knowing that the data underlying a model didn't change, eliminates the need to refit that specific model, thus saving time and resources. This is were recent_commit() is useful as it lists the latest commit in which a file was changed.

To summarise:

noamross commented 5 years ago

Thanks @brodieG and @jtr13 for your reviews, and @ThierryO for your thoughtful comments. I think this discussion has helped clarify the workflows that the package does and does not try to address, and the extended documentation you discuss would go a long way in helping users understand this.

A couple of observations: dependency weight and the return of tibbles seems to be one of the most contentious issues among our author and reviewer communities. We mostly defer to authors on this. In general, we find dependency weight to be a bigger issue for developers than users. We also find it helpful to give users choices as well as sensible dependencies. A package-level option such as options(git2rdata.tibbles=TRUE) may be useful.

The tightness of the recommended workflows with version control, and utility of git2r in facilitating this, even if individual functions can be separated into their vcs and non-vcs components, seems to me to warrant having git2r Imported so as to make combined convenience functions available to the users. (I find automated versioning of a data set changing due to updates and corrections a compelling example case, and suggest an example incorporating continuous integration may be useful.)

Finally, while we won't require a package name change, I agree with both @ThierryO that having git in the name is useful for discovery, and @jtr13 that the current name with git2r, (and the description above, "a companion package for git2r"), does make it seem that it primarily extends the domain of the git2r package. Something that evokes the workflow may be more informative.

brodieG commented 5 years ago

@ThierryO I see that you have made several changes. Do you have a new version we should work off of now (if you do, a tag of that version would be great)? Do you have other changes in the works we should wait for you to complete before we move to the detailed review? @noamross what's the time frame for the detailed review? One thing I'm wondering is whether it makes sense to wait for the vignettes outlined in @ThierryO most recent response.

noamross commented 5 years ago

@brodieG I suggest waiting for @ThierryO to complete current changes including vignettes before giving the package another look.

ThierryO commented 5 years ago

I'm working on the package in my spare time. I'll notify when a new version is available for review.

ThierryO commented 5 years ago

@noamross : what is the policy on suggested packages? I'm thinking about using ggplot2 for plots in the vignettes.

noamross commented 5 years ago

Sorry for the late reply on this one, @ThierryO. Suggested packages are are fine, especially for the purpose of displaying outputs in vignettes.

ThierryO commented 5 years ago

A new version (tagged v0.0.3) is ready for another round of review.

noamross commented 5 years ago

Thanks @ThierryO! @brodieG and @jtr13, let us know if you are satisfied with these changes, including covering any areas you weren't able to in the last review.

@ThierryO, can you give a brief summary of changes? I notice there isn't a NEWS entry for v0.0.3, or is it effectively the same summary as 0.0.2?

ThierryO commented 5 years ago

I forgot to update the NEWS. The difference between 0.0.2.9000 and 0.0.3 are mainly cosmetic and fixed typos. I bumped the version number because I did't want to change 0.0.2.9000 into 0.0.2.

brodieG commented 5 years ago

@noamross what is the new deadline for reviews?

noamross commented 5 years ago

We normally aim for 2 weeks for the second round, but as the first round was partial/specialized here, let's go for 3 weeks, or April 10.

brodieG commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Note: one small change was needed to get package to build locally (see patch where we change git2r::add(..., force=TRUE) because on my system I have a global gitignore for dot files.)

Functionality

A few "Package Guidelines" comments:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

Overview

This is a useful and well written package. Congratulations to Thierry for crafting it.

The code is clean and clear. For the most part, parameters are validated and corner cases are accounted for. Test coverage is complete (at least on a lines of code basis). Clearly Thierry is very conscientious and I am as confident as I can be from static "eyeball" analysis alone that there are probably not too many bugs lurking in the code outside of the potential ones I highlight in the issues section next. I also encountered no bugs in my manual testing. This is far from a guarantee of course.

I see no obvious performance issues.

The documentation is extensive, particularly the vignettes. I'm really impressed by how seriously our initial round of feedback on the matter was taken. The function documentation is complete, particularly with the examples, but it is a bit too terse (more details later).

Issues That Should Be Addressed

These are important and my approval of the package is contingent on them being addressed in some way:

Strongly recommended

Recommended:

Every other comment in the review is provided for the author to decide whether to implement or not.

File Storage

In an earlier iteration I had suggested that files be stored in a subfolder for convenience and future flexibility. In reviewing the code and thinking of additional validation required in read_vc and prune_meta, I am becoming increasingly convinced that this achieves another important role: discouraging non-vc "use" of the data to interfere with the vc function. Two motivating examples:

You could have something like:

iris.grx +- UUID (a UUID identifier you can use to check this is actually a vc dir) +- meta (this corresponds to iris.yml) +- data (this corresponds to iris.tsv)

grx is a made up scary sounding extension. Obviously everything is still stored in plain text so diffs, etc., continue to work fine.

Then export_vc('iris.grx') can pull out the tsv and yml.

This is a very cheap way to make your package more robust to file system interactions you do not control.

Typos

See attached patch: git2rdata_typo_patch.txt

Vignettes

Generally

Title case in headers, particularly the top level vignette title? Not something I particularly care about, but I hear some can be fastidious about it.

Getting Started

Another frequently used csv read/write set of functions is data.table::fread/fwrite. Suffers the same factor loss infirmity as the other functions reviewed.

In Re:

POSIXct is written as an integer to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC.

POSIXct is numeric, not integer, consider:

> options(digits=22)
> as.numeric(as.POSIXct('2019-01-01 10:01:03.13413'))
[1] 1546354863.134130001068

I'm not sure this had any practical significance, although you may need to check that sub-second precision is not lost, and if it is, at least document and warn that this is happening when it happens.

~~Additionally, how comfortable are you with time zones and POSIXct? I am not all, and know just enough about them to be a little scared. Have you confirmed that timestamps with different timezones are correctly converted to UTC? It seems not (see comments in meta section).~~

Suggested Workflow

Introduction

The database defines a variable number of dataframe (e.g. species can be added or removed). We have defined a standard analysis which should run for each group. We want to repeat the analyses with some predefined frequency (e.g. once every year). In order to make the analyses reproducible, we want to store the relevant data in a git repository.

This paragraph might be confusing to some. You should either generalize further, or layout a clear schema of what the database contains. Right now you are at a middle level where you provide some detail, but not enough to be able to infer the exact schema. Does each table contain one species? What is a group in this context? A species, or a sub-group in each table?

Storing Dataframes

See comments about content list below.

Automated Workflow

I find this section more ambitious than it needs to be. It tries to introduce several concepts at once, when perhaps the single most important one is just the use of store_data with a single instance of the data. lapplying over several instances of the data adds a layer of complexity that will make it harder to understand the simple and more likely use case where there is just one instance of data. After all, aren't the different instances of the data supposed to be tracked by the VCS system, rather than by a nested list in the users workspaces?

Another drawback of the multi-instance example is that it interferes with the previous instances. The reader is left wondering what on earth this weird nested list (content) we're creating is since we don't even use it until much later.

I think you could introduce store_data ahead of the "Third Commit" and use that for the third commit. Text could be "After having done this two months in a row, you are probably wondering if we could automate some of these steps...". This also provides motivation for the "Automated workflow". This may require you to rework the Second commit to delete a data set so you have an example to motivate the functionality of store_data.

Nit: store_data should at a minimum pluralize the df argument to make it clear that it contains multiple data frames, or even better use df_list, or alternatively, use function(..., repo) {}.

Analysis Workflow

Great example! One nit: maybe add newlines or otherwise try to better format the meta data displayed ahead of the table? It's a bit overwhelming with the hashes and temp path names. If at least each element is on its own line it will be easier to see what's going on.

Other Thoughts

Example Data Structure

Not a big deal at all, but it is very odd to me to see a database with tables of identical structure that are only different in the dataset identifier. This would normally be stored as one table with a dataset column.

Associating Code With Data

Your footnote about "code changing" reminded me that it is useful to be able to associate the data with the version of the code that ran it. Perhaps the write_vc function should warn when in stage=TRUE mode if there are other unstaged changes? This would improve the probability that scripts with analysis residing in the same repository at the data are also committed. It would be a shame if analysis became irreproducible because the code that generated the analysis was not committed along with the data. This is not critical to address at all. Just a thought.

Efficiency in Terms of Storage and Time

This is a great vignette. Very informative.

You should promote the data storage statistics to a more prominent part of the package, e.g. the README with something along the lines of:

Data stored with git2rdata will generate git repositories x% smaller than if stored as RDSes, up to y% smaller than if written as is, etc...

The only additional bit of information that would be worthwhile to add is some guidelines for how people should extrapolate file growth, e.g. made up:

With a data frame of the same composition as airbag and in-memory size X, you can expect to use X (file change % per commit) (typical compression factor) * number of commits. For airbag this translates to ~6MB over 100 commits. For reference the repository for <name of popular package with many commits> is ZZ MBs in size.

Optimizing storage for ...

Good vignette. See typos in patch.

Generic Documentation Comments

Descriptions

In general these are too terse. Several do not describe what the function actually does, e.g.:

You are better off making the titles briefer and the description more descriptive, e.g. for meta title becomes "Optimize Data for Storage", and description goes into more detail of what happens to data frames and vectors, and whether there are actually any user-facing use cases, or if this is just exposed to the user for study or whatever.

Nit Picks

Title Case. While this is not strictly requested in Writing R Extensions, it is universally done in the base packages.

Almost all sections of the documentation should be written in full sentences. Make sure you have periods at the ends of your sentences, even for short descriptions, parameters, return values.

See base functions such as ?mean for stylistic cues.

Common Parameters

@param x: "the 'data.frame" stray quote mark? @param file: "the name of the file without file extension". This could use more explanation. E.g. "the base file name shared by the tsv and yml files that contain or will contain the version control ready data"

Other

Examples should cleanup on exit, as per CRAN policies. Currently some create temporary files (e.g. list_data) that are not fully cleaned up (e.g. repo, root dir).

Good use of "See Also" to link related functions.

For the re-exported functions, the display link would be best if it showed the package name, e.g. git2r::commit, to make it completely obvious what is going on. It is slightly disorienting to land on the "commit" help page, only to be told to go see commit. Additionally, you should add a period after the function name as "Description" is supposed to be at least a sentence.

Functions

Parameter Validation

It is great to see that the functions validate their parameters carefully.

One additional thing you could do, although this isn't common practice, is to also check that the dispatching parameter actually is what it is supposed to be. Just because a "character" method was called doesn't mean that the parameter is character since in R you can call methods directly.

list_data

In:

gsub(paste0("^", root, "/"), "", data_files)

There is the risk that root could contain regex characters with special meaning. You can use something like:

has.root <- substr(data_files, 1, nchar(root) + 1) == paste0(root, "/")
data_files[has.root] <-
  substr(data_files[has.root], nchar(root) + 1, nchar(data_files[has.root])

I have not tested this, but you get the idea. Watch out for corner cases. I don't think gsub(..., fixed=TRUE) can work because you no longer can use the start of string anchor.

meta

Docs

The description should give some indication of what the function does, not just that write_vc applies it. Additionally, more detail on the data.frame method is warranted (i.e. it applies meta on the columns).

Character Method

Good job of handling corner cases (e.g. strings containing quotes).

Factor Method

  if (isTRUE(optimize)) {

Can cause problems for users that might pass a "truthy" value expecting this to work as is so often the case in R. You should change the function to either validate the parameters itself, or have more sophisticated handling of this value. I guess more generally the method should validate its input since it can be called directly.

Logical Method

See Factor above.

POSIXct Method

UPDATE: I think this is actually okay, although it will change the display representation for users. Maybe record time zone and try to restore?

I am not an expert on this, but assuming that the timezone is UTC causes problems, e.g. on my system (OSX):

> x <- data.frame(as.POSIXct('1970-01-01'))
> write_vc(x, 'hello', strict=FALSE)
19c2fa6018df9fd9f6e6eb33ddd5f02d33383335 
                             "hello.tsv" 
9aca22d70c2aab43e6cece767909924d5c1d5a50 
                             "hello.yml" 
Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting,  :
  no sorting applied
> read_vc('hello')
  as.POSIXct..1970.01.01..
1      1970-01-01 05:00:00
> as.POSIXct('1970-01-01')
[1] "1970-01-01 EST"
> Sys.timezone()
[1] "America/New_York"
> 

This risks being a headache resolving.

Date Method

I don't think tz="UTC" does anything here?

    z <- format(x, format = "%Y-%m-%d", tz = "UTC")

data.frame Method

This function should check its parameters, sorting in particular.

That ..generic is a reserved column name should be documented in write_vc and also probably in meta.

In:

    x <- x[do.call(order, x[sorting]), , drop = FALSE] # nolint

You need unname(x[sorting]) to avoid potential collisions between order parameters and column names.

You need to handle the case where sorting is zero length (included in patches).

> meta(iris, sorting=character())
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

Further down, in:

  if (has_name(dots, "old")) {

Probably not of great consequence, but it feels "wrong" to have an undocumented internal optional parameter. I imagine there aren't many if any situations that will lead to a collision, but it would feel better to me to make it a formal argument and document it, even if as .old and marked as "for internal use".

Finally, probably okay in this context because you overwrite z:

  z <- lapply(z, `attr<-`, "meta", NULL)

But you should be aware if you were not that calling setter functions indirectly is not recommended as it is possible under some circumstances they will modify the element by reference.

> x <- c(a=1)
> x
a 
1 
> y <- `names<-`(x, 'b')
> y
b 
1 
> x  # `x` changed by reference!
b 
1 

read_vc

Docs

The description should describe what the function does, e.g. "Retrieves a data.frame stored with write_vc from file or git repository storage."

It is not clear from the documentation that this is an S3 generic with methods. Adding @rdname read_vc to the methods should mostly address this. You may also want to make it clear that the generic dispatches on the second argument, which is quite uncommon (in fact I discovered it was possible when reviewing this package). You should also consider adding some commentary about how it might be relatively straightforward to extend git2rdata to work with other VCS systems by adding S3 methods for this generic, modeled on the git_repository method.

File integrity

read_vc should take some measures to ensure that neither the data nor the meta data are corrupted. It seems reasonably easy that someone will either manually modify one of the files intentionally, or unintentionally e.g. via git checkout.

Some possible checks:

write_vc

Documentation

That ..generic is a reserved column name should be documented in write_vc and also probably in meta.

write_vc(..., sorting=TRUE) # sort all columns?

Performance

anyDuplicated on a data frame is costly:

> library(ggplot2)
> treeprof::treeprof(fn <- write_vc(diamonds, 'diamonds', sorting=names(diamonds)))
Profiling
auto gc: running with gc() first
First run in 2.08 seconds
Looping to 5 seconds
Parsing Rprof
Done
Ticks: 1676; Iterations: 3; Time Per: 2.009 seconds; Time Total: 6.026 seconds; Time Ticks: 1.676

                                                              milliseconds
write_vc ---------------------------------------------------- : 2009 -   0
    write_vc.character -------------------------------------- : 2009 -   0
        meta ------------------------------------------------ : 1435 -   0
        |   meta.data.frame --------------------------------- : 1435 -   0
        |       anyDuplicated ------------------------------- : 1332 -   0
        |       |   anyDuplicated.data.frame ---------------- : 1332 -   0
        |       |       anyDuplicated ----------------------- : 1332 -   0
        |       |           do.call ------------------------- : 1332 -   0
        |       |               <Anonymous> ----------------- : 1332 -   0
        |       |                   mapply ------------------ : 1332 - 241
        |       as.data.frame ------------------------------- :   41 -   0
        |       |   as.data.frame.list ---------------------- :   41 -   0
        |       |       do.call ----------------------------- :   41 -   0
        |       |           <Anonymous> --------------------- :   41 -   0
        |       |               row.names<- ----------------- :   16 -   0
        |       |               |   row.names<-.data.frame -- :   16 -   0
        |       |               anyDuplicated --------------- :   14 -   0
        |       |               |   anyDuplicated.default --- :   14 -  14
        |       |               as.data.frame --------------- :   11 -   0
        |       |                   as.data.frame.integer --- :   11 -   2
        |       lapply -------------------------------------- :   31 -   0
        |       |   FUN ------------------------------------- :   31 -  11
        |       |       meta -------------------------------- :   14 -   0
        |       [ ------------------------------------------- :   31 -   0
        |           [.tbl_df -------------------------------- :   31 -   0
        |               do.call ----------------------------- :   26 -   0
        |                   <Anonymous> --------------------- :   26 -   0
        |                       do.call --------------------- :   25 -   0
        |                           order ------------------- :   25 -  25
        write.table ----------------------------------------- :  531 -   0
        |   .External2 -------------------------------------- :  526 - 526
        hashfile -------------------------------------------- :   41 -  41

If you'd like you can take advantage of the radix sort that the data.table authors contributed to R, especially since you are already ordering the data:

  # license: GPL >= 2
duplicated2 <- function(dat) {# `dat` must be ordered!!!
  if(nrow(dat) > 1) {
    dupes <- rowSums(
      vapply(dat, function(x) x[-1] == x[-length(x)], logical(nrow(dat) -1))
    ) == ncol(dat)
    c(FALSE, dupes)
  } else if (nrow(dat) == 1) FALSE else logical()
}
dat <- ggplot2::diamonds
sorting <- names(dat)
dat.o <- dat[do.call(order, unname(dat)),]  # you do this anyway
system.time(anyDuplicated(dat.o))
   user  system elapsed 
  1.466   0.019   1.496 
system.time(dupes <- duplicated(dat.o))
   user  system elapsed 
  1.319   0.017   1.344 
system.time(dupes2 <- duplicated2(dat.o))
   user  system elapsed 
  0.024   0.011   0.036 
all.equal(dupes, dupes2)
[1] TRUE

I have not thoroughly tested this or thought through corner cases.

I so happen to be writing a blog post about the newfound powers of order, which is why I am aware of this. Prior to the data.table contribution this type of improvement would not have been possible. Additionally, it turns out that you actually need to order your data as well anyway, so this is a win-win.

Interface

write_vc requires repetitive inputs. I mentioned this previously, where one could substitute for a variable name to avoid things such as:

write_vc(iris, 'iris')

The author brings up the magrittr usage, but it's easy enough to fall back to requiring input if the name is non syntactic. This is an example of the type of complexity I mentioned in my original review comment. Note the below function requires additional validation than what we do here.

write_vc2 <- function(x, file=substitute(x)) {
  if(is.name(file)) {
    file <- as.character(file)
    if(!grepl("[a-zA-Z][a-zA-Z0-9]*", file)) file <- NA_character_
  } else if (!is.character(file)) file <- NA_character_
  if(is.na(file)) stop("`file` must be a syntactically valid file name")
  file
}
write_vc2(iris)
[1] "iris"
iris %>% write_vc2
Error in write_vc2(.) : `file` must be a syntactically valid file name
iris %>% write_vc2('iris')
[1] "iris"

You'll need to think about the regex.

I may be too particular about this type of thing, but as a user I really dislike unnecessary work and I can see how I would get very annoyed at having to type things like write_vc(iris, 'iris') a lot when all the information is already there for the function to deduce my intent.

Other

For this type of error message that can be resolved with strict=FALSE, it might be helpful to add the resolution pathway, with a warning to ensure that's what you actually intend to do:

Error: new data uses more variables for sorting

e.g. as below or a variation to below:

Error: new data uses more variables for sorting; if you really intend to sort by
`c('Sepal.Length', 'Sepal.Width')` instead of `c('Sepal.Length')` use
`strict=FALSE`.

Irreproducible Bug

Somehow at some point the yml file got corrupted. I have no idea what I was doing or how I got to this. I can't reproduce it. I'm reporting just in case you run into it too so that at least you know it's happened to someone else:

> write_vc(iris, 'iris')
Error in yaml.load(string, error.label = error.label, ...) : 
  (<redacted>/iris.yml) Scanner error: while scanning a simple key at line 15, column 5 could not find expected ':' at line 16, column 1

rm_data / prune_data

Unintended Deletions (PROBLEM)

> list.files(pattern='do-not-remove-me')
character(0)
> writeLines('I am extremely important!', 'do-not-remove-me.yml')
> list.files(pattern='do-not-remove-me')
[1] "do-not-remove-me.yml"
> prune_meta('.', '.')
> list.files(pattern='do-not-remove-me')
character(0)

We just nuked our critical yml file. While the very careful user would realize this, many would not. To highlight this is a real life issue, I just noticed this:

Unstaged changes:
    Deleted:    _pkgdown.yml
    Deleted:    appveyor.yml

Please see the top level File Storage comments.

Documentation

The caveats noted in the workflow vignette need to appear in the function documentation as well. I'm thinking in particular about:

Caveat: when applied on a path, it will remove all data files, without warning. Even when the path points to a git repository. So use rm_data() and prune_meta() with care.

Other

More generally, what is the purpose of being able to rm_data the data separately from the meta data, either in the current guise or in the pre-prune_meta guise? It seems in the examples we either are over-writing data with new data, or deleting both the data and meta data.

You could simplify the user facing interface by replacing both functions with rm_vc to remove both the yml and tsv files at the same time. If the concern is that you are trying to manage a situation where there are a mix of yml and tsv files floating around that may not be of the vc variety mixed in with vc ones, then I think there are bigger issues as noted abovce.

Adding a pattern parameter that matches that of list.files could be useful.

relabel

Docs

Title should be clearer that it is factor levels in git2rdata meta data that is relabeled, not the factor itself.

The description stops short of actually describing what the function does.

Add a reference to the "Relabelling a Factor" section of "Optimizing storage for ..." vignette as you do in rm_data for the workflow.

jtr13 commented 5 years ago

(I have additional suggestions listed below, but my approval is not contingent on their completion.)

Note

A reminder that I was not able to fully test all functionality claims.

Followup from previous comments

README

I recommended updating the README to reflect the explanation of the problem that the software is designed to solve provided in the previous version of the vignette. The README has been updated. I would suggest further editing the Rationale section so that it is framed in terms of problems and solutions.

Again, I see three main rationales:

1) version control of data

2) optimization of data storage

3) meta data storage to preserve data types, factor levels, etc.

Currently more attention is paid in the README to meta data (3) rather than version control (1) and optimization (2).

Finally, I recommend adding another section, perhaps called "Getting Started," to the README that provides a short summary of the contents of each vignette. As such, the theory/rationale for the package would be separate from the how-to.

Building vignettes

I mentioned that build_vignettes = TRUE wasn't working. The installation instructions were updated with clear instructions that include a workaround for installing the vignettes. (Changes are happening with the remotes package so I wouldn't be surprised if the method for installing vignettes changes again soon. Mentioning as a heads-up to keep an eye on this, especially given the high quality of the vignettes -- important to make sure that they're seen!)

Examples

I noted that more comments would be helpful to guide the user through the examples. Clear examples with comments are now provided.

CONTRIBUTING file

I noted that there was no CONTRIBUTING file nor contributing guidelines in the README. A CONTRIBUTING file was added to the .github folder with a link in the README. (I think the root directory is a better place for CONTRIBUTING since it's more likely to be seen earlier; it won't pop up from .github until the contributor submits a pull request as far as I can tell.)

Rationale

I asked about the benefits of the optimization that the package provides. An excellent vignette, Efficiency in terms of storage and time, was added that addresses this issue in depth.

Package name

I suggested changing the package name. The author considered the issue and holds that the name is appropriate.

rm_data() and prune_meta()

I asked about the benefit of using rm_data(). The author responded: "The rm_data() and prune_meta() are useful in an automated workflow where changes in the raw data can lead to the removal of data sets. First apply rm_data() to remove all tsv files. Then write all the data using write_vc(). This is (re)create the tsv files which are still needed. Since all tsv were removed at the start, the ones we don't need any more are no longer present. The final step is to use prune_meta() to remove leftover yml files with no matching tsv file...."

Clearly prune_meta() serves a purpose in looking for matches. I am also convinced that rm_data() is useful; it's harder than I remembered to remove all files of a certain type from R. (Can't do file.remove("*.tsv") for example.) These functions have additional benefits such as defaulting to recursive = TRUE.

Sorting

I asked about an option to turn off sorting. The author responded: "Making the sorting optional is OK when not storing the files under version control. However, then it is up to the user to use the sorting when writing to a file path and doing the version control outside git2rdata diffs..." Currently a warning is given if no sorting parameter is supplied. I believe that previously an error was thrown; I think the warning is a better option here.

New comments

pull() conflict

The pull() function reexported from git2r conflicts with dplyr::pull(). Not much you can do here, just flagging it.

Warnings

Should some of the warnings perhaps be messages instead? "Warning" seems too strong for example in this case:

library(git2rdata)
    write_vc(iris[1:6, ], file = "iris", sorting = "Sepal.Length")
#> 11d423327dc1f67e16183dbd9871ad931d3b9689 
#>                               "iris.tsv" 
#> fb962b490dbe4038512e52b72b8182e11b49a60d 
#>                               "iris.yml"
    write_vc(
        iris[1:6, -2], "iris", sorting = "Sepal.Length", strict = FALSE
    )
#> Warning in write_vc.character(iris[1:6, -2], "iris", sorting = "Sepal.Length", : new data has a different number of variables
#> deleted variables: Sepal.Width
#> 1a649561f36566acf21c2f167e69abc54907badb 
#>                               "iris.tsv" 
#> af0219b9d8ebdea4d30c833b14d3fadc22a2f750 
#>                               "iris.yml"

Created on 2019-04-17 by the reprex package (v0.2.1.9000)

Error messages

It appears to be common practice to include warnings with errors though it seems to me it would be better not to get the warnings, so that the error isn't lost in the shuffle. For example:

> write_vc(iris[1,], "iris")
1d7d77ee4b27932996ebcfcba2c25ec9c5aa7630 
                              "iris.tsv" 
38fdea6690db69a1da66fa5f06f47b37cf3cfccc 
                              "iris.yml" 
Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting) :
  no sorting applied

> write_vc(iris[,2:4], "iris" )
Error: new data has a different number of variables
deleted variables: Sepal.Length, Species
In addition: Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting,  :
  no sorting applied

Also, it would be helpful to get a clear message that on error the data file was not changed.

Documentation

read_vc() has the same examples as write_vc(), which makes sense since they work together though you could cut the write_vc() examples that aren't followed by read_vc()

Vignettes

"Getting started"

Small comment:

library(git2rdata)
write_vc(x = x, file = "first_test", root = path, strict = FALSE)

produces the "no sorting applied" warning described above. It would be helpful if a link were provided here to the optimization vignette where this is clearly explained. (I tend to think something went wrong if I get a warning running vignette code without a heads-up.)

Overall, the vignettes are incredibly informative and add great value to the package. As noted above, it would be helpful to provide short summaries of the vignettes in the README, with a suggested order, such as:

  1. "Getting Started": ...

  2. "Optimizing storage for version control" ...

  3. "Suggested workflow for storing a variable set of dataframes under version control" ...

  4. "Efficiency in terms of storage and time" ... performance benchmarks...

ThierryO commented 5 years ago

Thanks to both @brodieG and @jtr13 for your detailed reviews. Your comments allow me to improve the quality of the package. I'll look into them and notify you when a next version is available. I'll giving an oral presentation on git2rdata at useR!2019. It would be nice to have the package on rOpenSci by then.

Below is a response on a few issues raised by @brodieG. Can I get some feedback on this as some these potentially require a major overhaul of the package.

General issues

File format

Using a custom file format has indeed the benefit that the user is less likely to tamper with the (meta)data. However, a comment on an earlier version explicit asks to store the data in an open standard format.

IMHO, the benefits using a clear open standard outweighs the fact to make it tamper proof. Image the case in which the data was send to someone without knowledge of git2rdata, or somebody who wants to use the data with other software. This is doable with the current format. Using unknown 'scary sounding' extensions will make that much harder, even when the data is stored in plain text format.

git2rdata does not aim to avoid any tampering with the data. It rather make it easier to document the history using a VCS.

Currently, I haven't tested how read_vc() reacts when the data and metadata no longer match. Some checks are easy to implement in read_vc() e.g. changes in the metadata, new variables added to the data, missing variables in the data, new order of the variables in the data, order of the observations, ...

I'm not sure if we should prevent user to make changes to the (information content of the) data itself using other software. Think of using the Github website to correct typo's in the data, to remove observations from the data or to even add a few observations. If these changes corrupt the data, the user can revert them use the VCS or the backup.

metadata pruning

The rm_data() and prune_meta() combo are only relevant in a workflow using a VSC in which a variable number of object are written. Using these functions in combination with a VSC is semi-safe as the user can revert the changes.

I could make these functions safer by adding something like an git2rdata.index. Such file would record all files written by git2rdata. rm_data() and prune_meta() can use this information to limit the removal of file to those created by git2rdata.

Time zone issue

IMHO, the issue is rather that print.data.frame doesn't show the time zone.

library(git2rdata)
x <- data.frame(TS = as.POSIXct('1970-01-01'))
fn <- write_vc(x, 'hello')

## Warning in meta.data.frame(x, optimize = optimize, na = na, sorting =
## sorting, : no sorting applied

z <- read_vc('hello')
z

##                    TS
## 1 1969-12-31 23:00:00

z$TS

## [1] "1969-12-31 23:00:00 UTC"

Sys.timezone()

## [1] "Europe/Brussels"
brodieG commented 5 years ago

Badge

For the badge I'm going off of the guide:

Badges for continuous integration and test coverage, the badge for rOpenSci peer-review once it has started (see below)

File Format

It's not critical that you have a custom "format" for git2rdata. I will note though that the "format" I propose still stores the data in open format in a completely accessible way. Diffs will show it, CSV readers will be able to read it, etc. I'm pretty sure that the concern would only really exist in the case where the above isn't true, where you need specialized tools to access the data.

Fundamentally the git2rdata object is a new non-standard format as it is the combination of the data and the meta data. By putting the files together in a folder you are making that format explicit and giving the signal to users that if you mess with the data, you may well break the git2rdata object. You can even have a README.txt in the folder explaining this instead of using a scary extension. This is probably better anyway, and will allow recipients access to the data with the appropriate warning.

Finally, if you still prefer to have the yml and tsv files floating around, I won't stop you so long as you take measures to ensure that the file pruning and deletion only affects files that were created by git2rdata. Do note that there is an inherent fragility at having an additional meta data file floating around. What happens if someone sends you a new data/meta-data set? If you have a folder then you are all set. An alternative would be to add a UUID in the yml so that you can be sure (or at least very certain) that the yml actually belongs to git2rdata. Additionally, it will be particularly important that you check for data / meta-data corruption, although I would personally put those checks in under any implementation.

To recap, based on your feedback and further thought, this is what I would have the folder look like for e.g. iris:

iris (or maybe iris.git2rdata)
+ README.txt
+ git2rdata.version    # text file with git2rdata version, see below
+ iris.tsv
+ iris.yml                     # contains hash of data file

The git2rdata.version file will allow you to make changes in the future to the directory format as you can check it on read, and if necessary, convert to the new format.

ThierryO commented 5 years ago

Version 0.0.4 is ready for review.

Changes as listed in NEWS.md

BREAKING FEATURES

NEW FEATURES

Bugfixes

Other changes

Changes not explicitly mentioned in NEWS.md

Suggestions not implemented

File structure

Warning for unstaged data

Checking that the dispatching parameter actually is what it is supposed to be.

Storing timezone

Pattern in rm_data() and prune_meta()

Automatic naming of file

CONTRIBUTING file in the root

message() instead of warning() when strict = FALSE

Different examples for read_vc() and write_vc()

brodieG commented 5 years ago

From a quick skim this looks good. I'll review more thoroughly sometime next week and report back.

brodieG commented 5 years ago

Final approval (post-review)

Looks good to me. I did not test all the new functionality, but did confirm that the accidental yml removal which was the blocking issue is fixed. It also does look like all the author-accepted suggestions are indeed in the new version, and I have no concerns with the author-rejected suggestions. This was all tested against v0.0.4 tag.

Nice job finding corner cases in the new duplicate function.

Some minor comments, none of which gets in the way of approval:

@noamross next steps?

noamross commented 5 years ago

Approved! Thanks @ThierryO for submitting and @brodieG and @jtr13 for your reviews! I know this has been a long road, but it's been very thorough and we're glad to have this useful and well-vetted package in our collection. @ThierryO, should you choose to address the remaining small comments, you may, but we are good to go.

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

In our developer's guide, this chapter starts the 3rd section that's about guidance for maintenance after acceptance.

Welcome aboard! We'd love to host a blog post about your package - either a short introduction to it with one example or a longer post with some narrative about its development or something you learned, and an example of its use. If you are interested, review the instructions, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

ThierryO commented 5 years ago

Thanks for accepting the package.

And thanks to @brodieG and @jtr13 for their reviews which improved the package a lot. I'm happy to acknowledge this by adding you as reviewer in the DESCRIPTION.

The README on GitHub starts by referring to the pkgdown website of the package. There the vignette() code is automatically linked to the HTML vignette. Such a link is IMHO more stable than a hard coded link. Is this sufficient or do you prefer to have a hard coded link? Maybe the message on the pkgdown website should be more prominent.

I'll give a 15' talk on git2rdata at the upcoming use!R 2019 conference. I could rework that presentation into a blog post. Would that be a good idea @stefaniebutland?

brodieG commented 5 years ago

I did miss the message about the pkgdown site, or didn't make the connection in the next section. It's probably fine as it is. FWIW that link is broken right now.

stefaniebutland commented 5 years ago

I'll give a 15' talk on git2rdata at the upcoming use!R 2019 conference. I could rework that presentation into a blog post. Would that be a good idea @stefaniebutland?

That sounds good @ThierryO. Since use!R is July 9-12, we could publish prior to your talk so that people could refer to it during the conference. In the talk you could invite people to submit their use cases to our discussion forum. That might get your work and the package some extra attention. After your talk, you could add any new insights as a comment on the post.

To publish before use!R, let me know and submit a draft for review by Tuesday July 2. To publish after, submit a draft when you are ready.