Closed ThierryO closed 4 years ago
thanks for your submission @ThierryO - editors are discussing now
Thanks for your submission @ThierryO! This is an interesting package and it passes checks with nothing but this from goodpractice::gp()
:
✖ omit "Date" in DESCRIPTION. It is not required and it
gets invalid quite often. A build date will be added to the package
when you perform `R CMD build` on it.
I'm seeking reviewers for it now. FYI, because of the thorny nature of the problem your tackling - data versioning workflows - I anticipate a lot of discussion around high-level concepts and architecture. I'm going to look for reviewers with a lot of experience on these workflow issues, and encourage them to start that discussion before checking off all the low-level boxes so as to have an efficient and productive review.
Reviewers: @jennybc @brodieG Due date: 04-01-2019
Reviewers assigned: @jennybc and @brodieG. At reviewers' request, final review date is extended to deal with travel and holidays. However, I request that reviewers provide some comment on high-level workflow/architecture issues that I think are likely to be the main points of sooner, before before a more detailed look at the code. Roughly:
Thanks for assigning the reviewers. I'm looking forward for their comments. There were some minor changes to the package. The latest version is available at inbo/git2rdata#5
Hi Thierry. This looks like an interesting package to review; looking forward to it.
One small request: do you mind creating tags/releases that we can work off of so there is no issue of the code changing under our feet as we are reviewing it? Ideally this would also be accompanied by a unique package version in DESCRIPTION. This will ensure we're all exactly on the same page about what version of the code is being reviewed. Thank you.
Good point. There is currently a single release tag v0.0.1.9000 which is the current HEAD of the master branch. I suggest that you use this for the review. I won't push to the master branch during the review process.
The current changes in the develop branch are mainly cosmetic, except one hotfix.
Sounds good, I will do all of my work of of v0.0.1.9000.
I haven't reviewed the internals of the package very closely yet, but overall I like the concept and the initial cut at the implementation. There is good attention to detail (e.g. helpful error messages), parameter checking, generally good documentation, and good test coverage of the key functions. These all demonstrate a commitment to robust programming practices.
At the request of the editor I'm going to focus on the top level questions first rather than go through the step by step review form. I did take the liberty to add some comments that fall outside of the scope of the questions asked, and would appreciate some feedback from @noamross on which points should fall under the scope of this review vs not. I'm finding this topic pretty interesting, but I don't want to over-burden the author by extending the review into areas that while interesting and related do not need to be addressed for the purposes of the review.
Is the expected workflow well communicated and implemented in the documentation and vignette?
The vignette provides end to end "toy" examples for both the versioned and unversioned workflows. These examples give a good sense of how the functions operate and how the user is expected to interact with them.
There are a few areas of a versioned workflow that would benefit from additional exploration:
The man
documentation for most of the functions I looked at do not have usage
examples. They should to satisfy Ropensci's "multiple points of entry
principle". Also, while I don't think this is a strict CRAN requirement, I have
gotten feedback from CRAN maintainers that examples should be included.
How does the workflow and model of the package work with other frameworks for versioning and reproducibility? Are concepts compatible and separable?
I do not have extensive experience with "reproducible workflows" beyond package testing, so what follows is based on first principles rather than first hand knowledge.
It strikes me that the package is more intertwined with a git workflow than it
needs to be. While write_vc
and read_vc
produce outputs that work well with
version control, there is no real need for an explicit link. For example, if
write_vc
returned file paths instead of file hashes, we could just do:
git2r::add(path=write_vc(df, 'df'))
instead of:
write_vc(df, 'df', repo, stage=TRUE)
To me the semantics are much clearer in the first example, or at worst, just as good as the second. The big difference is that you can strip away a large amount of the complexity in your code. You could potentially even ditch S4 in favor of S3 as the version control would be explicitly selected by the user, e.g. in the case of an imaginary svnr package:
svnr::commit(write_cv(df, 'df'))
It might even be reasonable to completely remove the git2r
dependency, with no
real loss in functionality. I'll freely admit it is entirely possible I'm
missing use cases where the tight integration with git2r
is beneficial, but I
don't think I've seen clear examples of that. Please feel free to point them
out to me if they already exist and I missed them, or make some up for review.
One possible example is that we have to resolve the question of how to link the data to the code as I suggested in Q1, which would be a little easier to do with the repository object directly available, but I'm not sure that tilts the balance back to the more complex implementation. Also, detection of unversioned files would also be easier with the repository object. That said, neither of this functionality currently exists, and the complexity to support the explicit git integration does.
In terms of integrating with other vcs, the expectation seems to be implicitly that someone will write an S4 method that supports it. That's a reasonably tall ask when the alternative above is possible.
Aside: is there a particular value of the file hashes that I'm missing? I would
imagine in most cases people wont care what they are, and for when they do, they
can just call git2r::hashfile
on the names anyway.
Is the package API (relatively) intuitive and work well with the workflow?
Yes, although there are some exceptions and possible improvements:
The semantics of rm_data(..., type='tsv')
, and rm_data(..., type='yml')
are
different. For the former the tsv
data files are removed, whereas for the
latter the yml
files are removed, but only if there is not a corresponding
tsv
file. This is only indirectly documented via the vignette examples.
We would recommend that you split up the behavior across functions, possibly called:
rm_data
or rm_tsv
to remove the data.prune_meta
or prune_yml
to remove the meta data that has no matching data
file (I gather this is what rm_data(..., type='yml')
does.You could document the difference in behavior in the function itself as a partial remedy, but the behavior is different enough that it feels as these should be different functions so the semantics are clear.
Another issue to consider is whether you should allow the removal of unversioned
files without some kind of warning. I realize that if you unlink
or
file.remove
in R you'd better be sure you know what you are doing, but
particularly for data that is in a version controlled directory, but not itself
under version control, aggressive file deletion might be undesirable behavior.
As mentioned in Q2, I think it is detrimental to try to intertwine these more than strictly necessary.
I would encourage you to create an intermediary git2rdata
data type that
inherits from data frame, possibly with the meta data attached as an attribute.
Let's call the new data type git2rdat
. We can then define (all in S3 here for
convenience):
as.git2rdat.data.frame
write_vc.git2rdat
write_vc.data.frame # calls as.git2rdat.data.frame and forward
all.equal.git2rdat
Which then simplify the workflow in the vignette:
all.equal(read_vc(file = "git/test", root = repo), x)
Is now equivalent to:
y <- read_vc(file = "git/test", root = repo)
sorted.x <- x[order(x$x), colnames(y)]
rownames(sorted.x) <- NULL
all.equal(sorted.x, y, check.attributes = FALSE)
I think it would be helpful to have more aggressive defaults to facilitate the workflow. For example, why not have the following:
write_vc <-
function(x, file=deparse(substitute(x)), root=getOption('proj.root'))
Then things like the following just work:
git2r::add(write_vc(iris))
git2r::commit()
This would produce 'iris.tsv' and 'iris.yml' at getOption('proj.root')
.
Obviously you will need something more sophisticated than what I use above (e.g. to handle bad names or automatically detecting a project root, or setting a default data folder under the auto-detected project root), but I find the simplicity of the above compelling.
@jennybc probably has a good view on this already, so I'll defer to her input here, but while we wait for it here is my view on this:
Git is not really meant to store data frames, so I think it would be really useful to have a more systematic assessment of whether this is a Bad Idea or not, and what the limits are.
You mention "medium sized data". I suspect different people will have different ideas of what that means. I recommend that you do some analysis of what the implications are of repeatedly committing a data set of the maximal size you intend your package to be used with. You should examine the effects on the size and performance of the repository over enough commits you can provide guidelines to your users about what it is okay to commit, and what is probably too much. You should probably also compare the effects of commits with small changes, vs large changes.
It may be that there is actually no problem with this at all, but it is unusual enough that it seems worth covering. At a minimum, I would imagine prospective users to be curious about it. There seem to exist several frameworks that separate the data from the repository so there probably is some reason for concern.
I think the vignette would benefit from a more thorough analysis of the rationale for the package. In the review issue you give the following as a potential application:
E.g. each year new data is added to the data and a new report is created. A snapshot of the raw data can be stored as plain text files under version control.
Based on my experience this is not a great reason to use version control. The data is effectively self versioned via the date variable so the version control is redundant (certainly, the old data could change too, but even that can be tracked explicitly in the data via a report ID for explicit analysis).
I feel that the version control is most useful for data that is not supposed to
change. The reason for this is that when data is supposed to change, the
natural thing to do is to analyze the changes to understand the meaning of the
changes. Diffs are terrible tools for this. They will tell you that number A
changed to number B, but will do so in a way that is very difficult to
meaningfully contextualize. This applies to tabular data in general. Unless
you are looking at a tiny data set, the table view is almost always useless.
Even iris
with its puny 150 rows is almost always viewed as plots or otherwise
subjected to statistical computation.
On the other hand, for things that are not supposed to change, at least without explicit review, knowing what changed is typically the biggest part of the battle. There a diff based workflow can be very helpful.
I see two obvious applications:
Clearly you have these things in mind, but it might be worth being more explicit about the benefit of versioned data in the context of reproducible analysis, and then explicitly highlight workflows that do that.
I found:
More generally, it might be worth doing some level of literature review to see what solutions exist already and whether they offer any insights on best practices to apply to your package.
Thomas's paper covers some of this, and also referencing daff
which I was
going to suggest separately. Edwin De Jonge wrote an R wrapper for it. I
have not personally used it, but I've done a fair bit of diffing of data frames
with line based diffs and honestly they do not work that well. Your idea of
sorting rows/columns will help to some extent, but you should assess whether it
makes sense to bring in a tabular diff.
Another idea raised in the same paper that I found interesting is to store data
in long format. It could obviate the need for daff
by making all changes row
base, but will make the data larger. I don't know how much compression would
help. This obviously introduces additional complexity that in the end may not
justify itself, but I did find the idea of reducing the data to a form where row
based diffs would work perfectly very appealing.
Also interesting:
I have not used any of these solutions, so I cannot vouch for their efficacy or safety.
Do you really need readr
? You are dealing with a special case where you know
the origin of the data, so you are less likely to need the special features
offered by readr
. Also, if you're explicitly targeting medium sized data, do
you need high performance on this? Unlike asserthat
, readr
carries along
many of its own dependencies. I would strongly encourage you to consider
whether write.table
could do the job well enough, even if it requires a little extra work.
Thank you @brodieG for a pre-review as detailed as many of our full reviews! All of this is in-scope for the review. I think the discussion of other frameworks is quite helpful. One does not have to adopt a totally different model as a result, but I think using that context to guide the API and documentation . I find it increasingly common and helpful to put a "Related Work" section in README/vignettes to compare a package's approach to other options.
@ThierryO, I'd suggest waiting for @jennybc's input before making decisions and starting work on any major changes, but do respond to @brodieG's comments as you see fit.
Thanks you @brodieG for the elaborate review. Here are my idea's on the topics you raised.
Adding an example where the data is actually used is a good suggestion. I have presented our workflow as a poster at the International Statistical Ecology Conference ISEC 2018. Please note that this poster predates the conception of git2rdata
. A precursor of the functionality was then available in n2khelper
.
Tracking analysis code is outside the scope of git2rdata
. However, I do that in our n2kanalysis
package. This package uses the output of git2rdata::recent_commit()
as an entry point for the data. The code version is documented as a list of used packages and their version (include commit hash for github package). n2kanalysis
has even a file and status fingerprints in each object which makes is possible to check the integrity of the object.
If we assume that the code to create and store the data.frames is all stored in a package foo
, then gitr2data::auto_commit()
can be used to document this. The commit message will read "Automated commit from 'foo'" and adds the full list of loaded packages and their versions at the moment of the commit. This is of course not tamper proof, but that is not the goal of git2rdata
.
If required, I can add some toy examples to the man
documentation. How extensive should these examples be?
The initial idea of the package was to read and write data.frames into a git repository. Hence the strong link with Git and git2r
. Then I had the idea that simply writing data.frames without version control (or to a different version control system than Git) might be relevant as well for other users.
Currently write_vc()
returns hashes because I use these hashes in our workflow (they are used to calculate the file fingerprints). The file names are returned as the names of the hashes. So the examples with git2r::add()
and svnr::commit()
are currently possible when you write them as below. I agree that it is slightly more cumbersome. Returning a names vector with file paths and adding the hashes as names is an option.
git2r::add(path = names(write_vc(df, 'df')))
svnr::commit(names(write_cv(df, 'df')))
The current version has only methods depending on the root
argument, making S3 a relevant option. Although it would require that root
is the first and mandatory object. S4 allows to write methods for different arguments. E.g. we could think of methods for x
objects that are something different than a data.frame (e.g. a SpatialPointsDataFrame).
rm_data()
on a git repository is a good idea. Maybe something similar like the override
argument of write_vc()
?rm_data()
over two functions is a great suggestion.I'm not sure if there is much benefit of creating a new class. We often just want to write the object to a file / vcs and don't bother about any intermediate formats.
The all.equal()
example is rather verbose because I want to stress the fact that write_vc()
alters the order of the rows and the columns. I'm not sure that the user would need an all.equal()
method for git2rdata
. This equality should be warranted by the package itself (and is through unit tests).
There already is a PR which implements root = "."
as default. Another option would be to use rprojroot::find_root()
to find a suitable root. A sensible order of defaults would be is_git_root
, is_svn_root
, is_r_package
, is_rstudio_project
.
file = deparse(substitute(x))
is problematic as it becomes file = "."
when piping is used, as demonstrated in the example below. Furthermore, it assumes that people use informative names for their object (which they seldom do). I prefer to force the user to think about naming the file.
library(dplyr)
test <- function(x, file = deparse(substitute(x))) {
message(file)
}
test(iris)
iris %>%
test()
A separate vignette with guidelines and caveat on using git2rdata
on a Git repository would be useful indeed. I'd like to postpone this. This allows me to try it on larger data sets and see were it breaks.
In my line of work, I don't have any control over the raw data source. And hence cannot guarantee that I can reconstruct the raw data. I've been repeating the same analyses every year on several data sources. At least two of them had a major overhaul of the database structure during the past five years. And there is no version control within the database itself.
In the past we published results which were questioned by one of our stakeholders. They disagreed with the results and feared that the results would lead to a detrimental change in policy. The simple fact that we were able state that both the data and the code was placed under version control instantly increased our trustworthiness. After a walk through of the data and the code, they now agree on the methodology and hence the results.
So the focus is not on "the analysis of what changed in the data", but rather on "what data was used in this version of the analysis". Therefore git2rdata
is just a tool in a fully reproducible analytically pipeline, not the full pipeline.
I'll have a look at these suggestions over the next weeks.
Storing each object in a dedicated folder is an interesting suggestion. Another benefit of this, is that we could allow the user to specify "file" extensions. E.g. write_vc(x, file = "test.txt")
would store the data in test.txt/data.tsv
and test.txt/metadata.yml
.
An earlier version of git2rdata
relied on write.table()
and read.table()
. The downside of that is, that storing characters was somewhat limited: NA
, \t
and \n
were not allowed when storing them unquoted.
Personally, I like tibble
over data.frame
because of the superior print()
method. Adding only the tibble
dependency would also add at least fansi
, fillar
and utf8
. In case you omit the dependencies in 'Suggested:' it will also required cli
, crayon
and rlang
.
While thinking on adding tibble
, readr
can into view. I agree that it again adds several dependencies: BH
, hms
and pkgconfig
(plus R6
and Rcpp
which are only dependencies of Suggested dependencies of git2rdata
).
However, I see several gains in using readr
read_vc()
returns a tibble
(agreed, rather a matter of taste than a critical improvement)read_tsv()
is faster than read.table()
as claimed by the readr
README, Efficient R programming, this blog, ...write_tsv()
handles NA and special characters efficiently by adding quotes only when really needed. And hence code that I don't have to write.read_tsv()
allows to specify the column type when reading the data.IMHO, these gains outweigh the extra dependencies.
@jennybc A reminder that your review was due last week, please get it in as soon as you can.
While waiting for the second review (which I think might take until after rstudio::conf
), some other quick responses:
If required, I can add some toy examples to the man documentation. How extensive should these examples be?
There's a lot of flexibility in this, but generally, these should be enough to demonstrate minimal usage/syntax. I usually make sure to have one example with only required arguments and examples that together include all optional arguments, and that print/show examples of the output value.
We outline some low-dependency options for tibbles in the development guide, but if the author can justify the trade-offs for dependencies we defer to their choice.
@noamross: Our main arguments in favour of readr
are:
write_tsv()
handles NA and special characters efficiently by adding quotes only when really needed. And hence code that I don't have to write.read_tsv()
allows to specify the column type when reading the data.Removing the readr
dependency would require us to write a lot more code to get the same result, or to limit functionality. E.g. in earlier versions (without readr
) we could not allow to write the string "NA"
because storing it without quotes makes it indistinguishable from the missing string NA
. Not allowing to store "NA"
seemed a better choise than not allowing to store missing values in characters. readr
has the option to add quotes only in cases were it is needed to eliminate ambiguity.
We'd like to get feedback from @noamross, @jennybc and @brodieG on this topic.
@ThierryO, I haven't responded as I wanted to give @jennybc a chance to comment, but since you're asking explicitly, some responses to the narrow questions regarding readr. Here is some simple code that deals with I think most of the pathological inputs you're describing:
df <- data.frame(
a=c(NA,"NA",'"NA"','""NA"'),
b=c(1,2,3,'N\tA'),
c=letters[1:4],
d=runif(4),
stringsAsFactors=FALSE
)
chr.types <- vapply(df, function(x) tail(class(x), 1), "")
chr.cols <- chr.types == 'character'
special <- logical(ncol(df))
special[chr.cols] <- vapply(df[chr.cols], function(x) any(grepl("\"|\t", x)), NA)
write.table(
df, sep='\t', quote=which(special), 'df1.tsv', na='NANANA',
qmethod='double', row.names=FALSE
)
read.table(
'df1.tsv', sep='\t', stringsAsFactors=FALSE, na='NANANA', header=TRUE,
)
You will need to to generate a string that doesn't exist in the data to use as an NA string. A simple low-cost example is to find the length of the longest string with nchar
and just use strrep('NA', len + 1)
, although that means you will also have to disallow strings longer than INT_MAX - 1 (32 bit) as INT_MAX is currently the string length limit in R.
This also quotes an entire column if any element contains either a tab or a quote. However, both these cases (and cases were there are columns that have strings that are INT_MAX long, which probably would be a terrible use case for this anyway) are likely to be rare.
Undoubtedly readr
will be much more sophisticated about how it handles this business, but perhaps given these examples the added value of the sophistication may seem lower. What is not low is the weight of the dependencies.
If you really still feel that there is substantial added value to the readr
functionality, I won't stand in the way of you including it. It just seems from where I stand that you can get most of what you're looking for without readr
and its dependencies (aside: it seems readr
could easily remove tibble as a dependency itself).
Re: returning tibbles, I don't see that as a positive. If someone likes tibbles and has the package installed, it is trivial for them to turn data frames into tibbles. Likewise if someone prefers data.tables. I think it makes more sense for package to be data structure neutral unless the data structure is integral to the package itself. Again, not something that would prevent me from approving the package, but chaining the package to a specific non-base data structure when it isn't necessary seems sub-optimal in the general use case.
If you do decide to leave readr as a dependency, then one thing you can do is record whether the input was a tibble or not, and then if not, return a normal data frame to the user.
EDIT: Forgot to mention this, read.table
has a colClasses
parameter that allows you to specify the classes of the columns in the tsv.
New reviewer added: @jtr13, review due 2018-02-05
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).
The package contains a paper.md
matching JOSS's requirements with:
[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
[x] Installation: Installation succeeds as documented.
[not reviewed] Functionality: Any functional claims of the software been confirmed.
[not reviewed] Performance: Any performance claims of the software been confirmed.
[not reviewed] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[not reviewed] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
In the interest of not delaying the review, I was not able to fully test all functionality claims.
The first two paragraphs of the vignette provide a much more detailed explanation of the problem that the software is designed to solve than the README. I know there's been a lot of discussion about what should go where and the importance of not having to maintain duplicate text. That said, I am of the opinion that important information like this should be in the README since my sense is that vignettes are often overlooked. (More so now since build_vignettes = TRUE
doesn't work).
I see that examples were added in the develop
branch. More comments would be helpful to guide the user through the examples.
There is no CONTRIBUTING file nor contributing guidelines in the README. URL
, BugReports
and Maintainer
are provided in DESCRIPTION file.
I was not able to perform a thorough review to confirm functional and performance claims.
All code in the vignette worked as expected as did additional toy examples.
The package claims to address three needs:
1) version control of data
2) optimization of data storage
3) meta data storage to preserve data types, factor levels, etc.
Taking these in reverse order: storing meta data in a .yml
file has clear and obvious benefits. Losing factor information in particular is a source of frustration for R users and the package provides functions that could be useful to even beginning R users.
The benefits of optimization are not as clearly apparent. It would be helpful to be provided with more context, that is, answers to questions such as: in typical use cases, how much smaller are the text files created by git2rdata
functions? Is there a time tradeoff?
Finally, with regard to version control, the contribution of git2rdata
--to the best of my understanding--is to sort rows and columns to minimize diffs. The work of actually maintaining version histories is performed by git, with git2r
providing the interface to R.
It should be noted though that git2r
is optional even for using the main git2rdata
functions: write_vc
and read_vc
. Particularly for beginning users, using these functions with the RStudio git interface would be a simpler workflow.
Given this scenario, the package name, git2rdata
is somewhat misleading and doesn't capture the breadth of contributions that the package makes. In addition, it limits flexibility in making these functions available to other version controls as suggested in other reviews. From the name it sounds as though the package would perform similar functions to git2r
in a new domain, that is, add functionality for git commands that are not available in git2r
. That, however, is not what it does. The vignette title "Storing Dataframes" is more on the mark; perhaps a name along the lines of storedf
would be a better fit. (On a personal note, I've had a lot of trouble with git2r
so I wonder if the link might scare off users who would otherwise benefit from this package.)
Some minor points: it's not clear what the benefit is of using rm_data()
vs. other means to remove files or overwrite them.
If the sorting parameter isn't essential it would be helping to have a way to not use it (sorting = FALSE
) or have some other default. For example, if all you care about is preserving factor information, it would be nice to have a simple workflow for doing so.
Thank you for your feedback, @jtr13.
I'll update the README (see https://github.com/inbo/git2rdata/issues/12). A CONTRIBUTING.md file is present in the .github
folder. A link will be added to the README.
I'll add a some comments to the examples in the help files.
I'll probably separate (and elaborate) the current vignette into several vignettes.
git2rdata
git2rdata
git2rdata
Making the sorting optional is OK when not storing the files under version control. However, then it is up to the user to use the sorting when writing to a file path and doing the version control outside git2rdata
. Which is something they will probably forget and end up with potentially large diffs...
The rm_data()
and prune_meta()
are useful in an automated workflow where changes in the raw data can lead to the removal of data sets. First apply rm_data()
to remove all tsv files. Then write all the data using write_vc()
. This is (re)create the tsv files which are still needed. Since all tsv were removed at the start, the ones we don't need any more are no longer present. The final step is to use prune_meta()
to remove leftover yml files with no matching tsv file.
git2r
?The functionality of git2rdata
seems a bit of overkill when not using version control. Storing dataframes as binary files using save()
or saveRDS()
is probably more efficient in such case. Git is a very popular version control system so most likely the one used in combination with git2rdata
. Hence we think it will be more easily found when the name refers to git.
The git2r
dependency is not strictly necessary when not using git. Or in an ad hoc workflow where the user manually writes data, stages the required files and creates the commits.
Another use case is an automated one with recurrent data storage and subsequent analysis. The user writes a script that queries a raw data source and stores the resulting dataframes under version control. Under such scenario, the user will most likely want to commit all dataframes written by the function. The user can use git2r::add(git2rdata::write_vc())
, or the more convient and readable git2rdata::write_vc(stage = TRUE)
. The user probably also want to commit the changes at the end of script. One options is git2r::commit()
potentially in combination with git2r::push()
. Another option is to use git2rdata::auto_commit()
or git2rdata::auto_commit(push = TRUE)
. The main benefit of auto_commit
is that it creates a standardised commit message listing the loaded R packages and there versions, thus documenting the code that was used to create the commit. Suppose that the script is stored into a custom package. Then the message from auto_commit()
is sufficient to document which code was used to create the commit. The code itself should document why and how the data is created.
One way to document the data used in subsequent analysis is to use the commit hash of the data that is checked out. In our use case, we have several different dataframes stored under version control. The same statistical model is fit to each of them. Not every dataframe will be updated in the next commit of the data. Knowing that the data underlying a model didn't change, eliminates the need to refit that specific model, thus saving time and resources. This is were recent_commit()
is useful as it lists the latest commit in which a file was changed.
To summarise:
auto_commit()
and recent_commit()
are useful functionality in some workflows to warrant the dependency of git2r
, at least as a Suggests package.git2r
as an Imports package has the benefit of cleaner code. About one third of the current code base depends on git2r
. It also allows to re-export core git2r
functions like repository()
, status()
, commit()
, pull()
and push()
. This reduces the need for the user to load both git2r
and git2rdata
for the basic functionality in an automated workflow.Thanks @brodieG and @jtr13 for your reviews, and @ThierryO for your thoughtful comments. I think this discussion has helped clarify the workflows that the package does and does not try to address, and the extended documentation you discuss would go a long way in helping users understand this.
A couple of observations: dependency weight and the return of tibbles seems to be one of the most contentious issues among our author and reviewer communities. We mostly defer to authors on this. In general, we find dependency weight to be a bigger issue for developers than users. We also find it helpful to give users choices as well as sensible dependencies. A package-level option such as options(git2rdata.tibbles=TRUE)
may be useful.
The tightness of the recommended workflows with version control, and utility of git2r
in facilitating this, even if individual functions can be separated into their vcs and non-vcs components, seems to me to warrant having git2r
Imported so as to make combined convenience functions available to the users. (I find automated versioning of a data set changing due to updates and corrections a compelling example case, and suggest an example incorporating continuous integration may be useful.)
Finally, while we won't require a package name change, I agree with both @ThierryO that having git
in the name is useful for discovery, and @jtr13 that the current name with git2r
, (and the description above, "a companion package for git2r"), does make it seem that it primarily extends the domain of the git2r
package. Something that evokes the workflow may be more informative.
@ThierryO I see that you have made several changes. Do you have a new version we should work off of now (if you do, a tag of that version would be great)? Do you have other changes in the works we should wait for you to complete before we move to the detailed review? @noamross what's the time frame for the detailed review? One thing I'm wondering is whether it makes sense to wait for the vignettes outlined in @ThierryO most recent response.
@brodieG I suggest waiting for @ThierryO to complete current changes including vignettes before giving the package another look.
I'm working on the package in my spare time. I'll notify when a new version is available for review.
@noamross : what is the policy on suggested packages? I'm thinking about using ggplot2 for plots in the vignettes.
Sorry for the late reply on this one, @ThierryO. Suggested packages are are fine, especially for the purpose of displaying outputs in vignettes.
Thanks @ThierryO! @brodieG and @jtr13, let us know if you are satisfied with these changes, including covering any areas you weren't able to in the last review.
@ThierryO, can you give a brief summary of changes? I notice there isn't a NEWS entry for v0.0.3, or is it effectively the same summary as 0.0.2?
I forgot to update the NEWS. The difference between 0.0.2.9000 and 0.0.3 are mainly cosmetic and fixed typos. I bumped the version number because I did't want to change 0.0.2.9000 into 0.0.2.
@noamross what is the new deadline for reviews?
We normally aim for 2 weeks for the second round, but as the first round was partial/specialized here, let's go for 3 weeks, or April 10.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
Note: one small change was needed to get package to build locally (see patch
where we change git2r::add(..., force=TRUE)
because on my system I have a
global gitignore for dot files.)
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).A few "Package Guidelines" comments:
git2r
package welcome the use of git2rdata
as this
package name?Estimated hours spent reviewing:
This is a useful and well written package. Congratulations to Thierry for crafting it.
The code is clean and clear. For the most part, parameters are validated and corner cases are accounted for. Test coverage is complete (at least on a lines of code basis). Clearly Thierry is very conscientious and I am as confident as I can be from static "eyeball" analysis alone that there are probably not too many bugs lurking in the code outside of the potential ones I highlight in the issues section next. I also encountered no bugs in my manual testing. This is far from a guarantee of course.
I see no obvious performance issues.
The documentation is extensive, particularly the vignettes. I'm really impressed by how seriously our initial round of feedback on the matter was taken. The function documentation is complete, particularly with the examples, but it is a bit too terse (more details later).
These are important and my approval of the package is contingent on them being addressed in some way:
rm_data
/prune_meta
comments, and
File Storage comments)meta
comments)Strongly recommended
Recommended:
do.call
and sorting
corner cases for meta
(see function section)write_vc
optimization suggested.Every other comment in the review is provided for the author to decide whether to implement or not.
In an earlier iteration I had suggested that files be stored in a subfolder for
convenience and future flexibility. In reviewing the code and thinking of
additional validation required in read_vc
and prune_meta
, I am becoming
increasingly convinced that this achieves another important role: discouraging
non-vc "use" of the data to interfere with the vc function. Two motivating
examples:
You could have something like:
iris.grx +- UUID (a UUID identifier you can use to check this is actually a vc dir) +- meta (this corresponds to iris.yml) +- data (this corresponds to iris.tsv)
grx
is a made up scary sounding extension. Obviously everything is still
stored in plain text so diffs, etc., continue to work fine.
Then export_vc('iris.grx')
can pull out the tsv
and yml
.
This is a very cheap way to make your package more robust to file system interactions you do not control.
See attached patch: git2rdata_typo_patch.txt
Title case in headers, particularly the top level vignette title? Not something I particularly care about, but I hear some can be fastidious about it.
Another frequently used csv read/write set of functions is
data.table::fread/fwrite
. Suffers the same factor loss infirmity as the other
functions reviewed.
In Re:
POSIXct is written as an integer to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC.
POSIXct is numeric, not integer, consider:
> options(digits=22)
> as.numeric(as.POSIXct('2019-01-01 10:01:03.13413'))
[1] 1546354863.134130001068
I'm not sure this had any practical significance, although you may need to check that sub-second precision is not lost, and if it is, at least document and warn that this is happening when it happens.
~~Additionally, how comfortable are you with time zones and POSIXct? I am not
all, and know just enough about them to be a little scared. Have you confirmed
that timestamps with different timezones are correctly converted to UTC? It
seems not (see comments in meta
section).~~
The database defines a variable number of dataframe (e.g. species can be added or removed). We have defined a standard analysis which should run for each group. We want to repeat the analyses with some predefined frequency (e.g. once every year). In order to make the analyses reproducible, we want to store the relevant data in a git repository.
This paragraph might be confusing to some. You should either generalize further, or layout a clear schema of what the database contains. Right now you are at a middle level where you provide some detail, but not enough to be able to infer the exact schema. Does each table contain one species? What is a group in this context? A species, or a sub-group in each table?
See comments about content
list below.
I find this section more ambitious than it needs to be. It tries to
introduce several concepts at once, when perhaps the single most important one
is just the use of store_data
with a single instance of the data. lapply
ing
over several instances of the data adds a layer of complexity that will make it
harder to understand the simple and more likely use case where there is just one
instance of data. After all, aren't the different instances of the data
supposed to be tracked by the VCS system, rather than by a nested list in the
users workspaces?
Another drawback of the multi-instance example is that it interferes with the
previous instances. The reader is left wondering what on earth this weird
nested list (content
) we're creating is since we don't even use it until much
later.
I think you could introduce store_data
ahead of the "Third Commit" and use
that for the third commit. Text could be "After having done this two months in
a row, you are probably wondering if we could automate some of these steps...".
This also provides motivation for the "Automated workflow". This may require
you to rework the Second commit to delete a data set so you have an example
to motivate the functionality of store_data
.
Nit: store_data
should at a minimum pluralize the df
argument to make it
clear that it contains multiple data frames, or even better use df_list
, or
alternatively, use function(..., repo) {}
.
Great example! One nit: maybe add newlines or otherwise try to better format the meta data displayed ahead of the table? It's a bit overwhelming with the hashes and temp path names. If at least each element is on its own line it will be easier to see what's going on.
Not a big deal at all, but it is very odd to me to see a database with tables of identical structure that are only different in the dataset identifier. This would normally be stored as one table with a dataset column.
Your footnote about "code changing" reminded me that it is useful to be able to
associate the data with the version of the code that ran it. Perhaps the
write_vc
function should warn when in stage=TRUE
mode if there are other
unstaged changes? This would improve the probability that scripts with analysis
residing in the same repository at the data are also committed. It would be a
shame if analysis became irreproducible because the code that generated the
analysis was not committed along with the data. This is not critical to address
at all. Just a thought.
This is a great vignette. Very informative.
You should promote the data storage statistics to a more prominent part of the package, e.g. the README with something along the lines of:
Data stored with
git2rdata
will generate git repositories x% smaller than if stored as RDSes, up to y% smaller than if written as is, etc...
The only additional bit of information that would be worthwhile to add is some guidelines for how people should extrapolate file growth, e.g. made up:
With a data frame of the same composition as
airbag
and in-memory size X, you can expect to use X (file change % per commit) (typical compression factor) * number of commits. Forairbag
this translates to ~6MB over 100 commits. For reference the repository for <name of popular package with many commits> is ZZ MBs in size.
Good vignette. See typos in patch.
In general these are too terse. Several do not describe what the function actually does, e.g.:
read_vc
: Note that the dataframe has to be written with write_vc() before it
can be read with read_vc(). meta
: write_vc applies this function automatically on your data.frame.You are better off making the titles briefer and the description more
descriptive, e.g. for meta
title becomes "Optimize Data for Storage", and
description goes into more detail of what happens to data frames and vectors,
and whether there are actually any user-facing use cases, or if this is just
exposed to the user for study or whatever.
Title Case. While this is not strictly requested in Writing R Extensions, it is universally done in the base packages.
Almost all sections of the documentation should be written in full sentences. Make sure you have periods at the ends of your sentences, even for short descriptions, parameters, return values.
See base functions such as ?mean
for stylistic cues.
@param x: "the 'data.frame" stray quote mark? @param file: "the name of the file without file extension". This could use more explanation. E.g. "the base file name shared by the tsv and yml files that contain or will contain the version control ready data"
Examples should cleanup on exit, as per CRAN policies. Currently some create
temporary files (e.g. list_data
) that are not fully cleaned up (e.g. repo
,
root
dir).
Good use of "See Also" to link related functions.
For the re-exported functions, the display link would be best if it showed the
package name, e.g. git2r::commit
, to make it completely obvious what is going
on. It is slightly disorienting to land on the "commit" help page, only to be
told to go see commit
. Additionally, you should add a period after the
function name as "Description" is supposed to be at least a sentence.
It is great to see that the functions validate their parameters carefully.
One additional thing you could do, although this isn't common practice, is to also check that the dispatching parameter actually is what it is supposed to be. Just because a "character" method was called doesn't mean that the parameter is character since in R you can call methods directly.
In:
gsub(paste0("^", root, "/"), "", data_files)
There is the risk that root
could contain regex characters with special
meaning. You can use something like:
has.root <- substr(data_files, 1, nchar(root) + 1) == paste0(root, "/")
data_files[has.root] <-
substr(data_files[has.root], nchar(root) + 1, nchar(data_files[has.root])
I have not tested this, but you get the idea. Watch out for corner cases. I
don't think gsub(..., fixed=TRUE)
can work because you no longer can use the
start of string anchor.
The description should give some indication of what the function does, not just
that write_vc
applies it. Additionally, more detail on the data.frame
method is warranted (i.e. it applies meta
on the columns).
Good job of handling corner cases (e.g. strings containing quotes).
if (isTRUE(optimize)) {
Can cause problems for users that might pass a "truthy" value expecting this to work as is so often the case in R. You should change the function to either validate the parameters itself, or have more sophisticated handling of this value. I guess more generally the method should validate its input since it can be called directly.
See Factor above.
UPDATE: I think this is actually okay, although it will change the display representation for users. Maybe record time zone and try to restore?
I am not an expert on this, but assuming that the timezone is UTC causes problems, e.g. on my system (OSX):
> x <- data.frame(as.POSIXct('1970-01-01'))
> write_vc(x, 'hello', strict=FALSE)
19c2fa6018df9fd9f6e6eb33ddd5f02d33383335
"hello.tsv"
9aca22d70c2aab43e6cece767909924d5c1d5a50
"hello.yml"
Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, :
no sorting applied
> read_vc('hello')
as.POSIXct..1970.01.01..
1 1970-01-01 05:00:00
> as.POSIXct('1970-01-01')
[1] "1970-01-01 EST"
> Sys.timezone()
[1] "America/New_York"
>
This risks being a headache resolving.
I don't think tz="UTC"
does anything here?
z <- format(x, format = "%Y-%m-%d", tz = "UTC")
This function should check its parameters, sorting in particular.
That ..generic
is a reserved column name should be documented in write_vc
and also probably in meta
.
In:
x <- x[do.call(order, x[sorting]), , drop = FALSE] # nolint
You need unname(x[sorting])
to avoid potential collisions between order
parameters and column names.
You need to handle the case where sorting
is zero length (included in patches).
> meta(iris, sorting=character())
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<0 rows> (or 0-length row.names)
Further down, in:
if (has_name(dots, "old")) {
Probably not of great consequence, but it feels "wrong" to have an undocumented
internal optional parameter. I imagine there aren't many if any situations that
will lead to a collision, but it would feel better to me to make it a formal
argument and document it, even if as .old
and marked as "for internal use".
Finally, probably okay in this context because you overwrite z
:
z <- lapply(z, `attr<-`, "meta", NULL)
But you should be aware if you were not that calling setter functions indirectly is not recommended as it is possible under some circumstances they will modify the element by reference.
> x <- c(a=1)
> x
a
1
> y <- `names<-`(x, 'b')
> y
b
1
> x # `x` changed by reference!
b
1
The description should describe what the function does, e.g. "Retrieves a
data.frame
stored with write_vc
from file or git repository storage."
It is not clear from the documentation that this is an S3 generic with methods.
Adding @rdname read_vc
to the methods should mostly address this. You may
also want to make it clear that the generic dispatches on the second argument,
which is quite uncommon (in fact I discovered it was possible when reviewing
this package). You should also consider adding some commentary about how it
might be relatively straightforward to extend git2rdata
to work with other
VCS systems by adding S3 methods for this generic, modeled on the
git_repository
method.
read_vc
should take some measures to ensure that neither the data nor the meta
data are corrupted. It seems reasonably easy that someone will either manually
modify one of the files intentionally, or unintentionally e.g. via git
checkout.
Some possible checks:
..generic
are there and are what they should be.class
element that is character, for factors the other elements
are present, etc)That ..generic
is a reserved column name should be documented in write_vc
and also probably in meta
.
write_vc(..., sorting=TRUE) # sort all columns?
anyDuplicated
on a data frame is costly:
> library(ggplot2)
> treeprof::treeprof(fn <- write_vc(diamonds, 'diamonds', sorting=names(diamonds)))
Profiling
auto gc: running with gc() first
First run in 2.08 seconds
Looping to 5 seconds
Parsing Rprof
Done
Ticks: 1676; Iterations: 3; Time Per: 2.009 seconds; Time Total: 6.026 seconds; Time Ticks: 1.676
milliseconds
write_vc ---------------------------------------------------- : 2009 - 0
write_vc.character -------------------------------------- : 2009 - 0
meta ------------------------------------------------ : 1435 - 0
| meta.data.frame --------------------------------- : 1435 - 0
| anyDuplicated ------------------------------- : 1332 - 0
| | anyDuplicated.data.frame ---------------- : 1332 - 0
| | anyDuplicated ----------------------- : 1332 - 0
| | do.call ------------------------- : 1332 - 0
| | <Anonymous> ----------------- : 1332 - 0
| | mapply ------------------ : 1332 - 241
| as.data.frame ------------------------------- : 41 - 0
| | as.data.frame.list ---------------------- : 41 - 0
| | do.call ----------------------------- : 41 - 0
| | <Anonymous> --------------------- : 41 - 0
| | row.names<- ----------------- : 16 - 0
| | | row.names<-.data.frame -- : 16 - 0
| | anyDuplicated --------------- : 14 - 0
| | | anyDuplicated.default --- : 14 - 14
| | as.data.frame --------------- : 11 - 0
| | as.data.frame.integer --- : 11 - 2
| lapply -------------------------------------- : 31 - 0
| | FUN ------------------------------------- : 31 - 11
| | meta -------------------------------- : 14 - 0
| [ ------------------------------------------- : 31 - 0
| [.tbl_df -------------------------------- : 31 - 0
| do.call ----------------------------- : 26 - 0
| <Anonymous> --------------------- : 26 - 0
| do.call --------------------- : 25 - 0
| order ------------------- : 25 - 25
write.table ----------------------------------------- : 531 - 0
| .External2 -------------------------------------- : 526 - 526
hashfile -------------------------------------------- : 41 - 41
If you'd like you can take advantage of the radix sort that the data.table authors contributed to R, especially since you are already ordering the data:
# license: GPL >= 2
duplicated2 <- function(dat) {# `dat` must be ordered!!!
if(nrow(dat) > 1) {
dupes <- rowSums(
vapply(dat, function(x) x[-1] == x[-length(x)], logical(nrow(dat) -1))
) == ncol(dat)
c(FALSE, dupes)
} else if (nrow(dat) == 1) FALSE else logical()
}
dat <- ggplot2::diamonds
sorting <- names(dat)
dat.o <- dat[do.call(order, unname(dat)),] # you do this anyway
system.time(anyDuplicated(dat.o))
user system elapsed
1.466 0.019 1.496
system.time(dupes <- duplicated(dat.o))
user system elapsed
1.319 0.017 1.344
system.time(dupes2 <- duplicated2(dat.o))
user system elapsed
0.024 0.011 0.036
all.equal(dupes, dupes2)
[1] TRUE
I have not thoroughly tested this or thought through corner cases.
I so happen to be writing a blog post about the newfound powers of order
,
which is why I am aware of this. Prior to the data.table
contribution this
type of improvement would not have been possible. Additionally, it turns out
that you actually need to order your data as well anyway, so this is a win-win.
write_vc
requires repetitive inputs. I mentioned this previously, where one could
substitute for a variable name to avoid things such as:
write_vc(iris, 'iris')
The author brings up the magrittr
usage, but it's easy enough to fall back to
requiring input if the name is non syntactic. This is an example of the type of
complexity I mentioned in my original review comment. Note the below function
requires additional validation than what we do here.
write_vc2 <- function(x, file=substitute(x)) {
if(is.name(file)) {
file <- as.character(file)
if(!grepl("[a-zA-Z][a-zA-Z0-9]*", file)) file <- NA_character_
} else if (!is.character(file)) file <- NA_character_
if(is.na(file)) stop("`file` must be a syntactically valid file name")
file
}
write_vc2(iris)
[1] "iris"
iris %>% write_vc2
Error in write_vc2(.) : `file` must be a syntactically valid file name
iris %>% write_vc2('iris')
[1] "iris"
You'll need to think about the regex.
I may be too particular about this type of thing, but as a user I really
dislike unnecessary work and I can see how I would get very annoyed at having to
type things like write_vc(iris, 'iris')
a lot when all the information is
already there for the function to deduce my intent.
For this type of error message that can be resolved with strict=FALSE
, it
might be helpful to add the resolution pathway, with a warning to ensure that's
what you actually intend to do:
Error: new data uses more variables for sorting
e.g. as below or a variation to below:
Error: new data uses more variables for sorting; if you really intend to sort by
`c('Sepal.Length', 'Sepal.Width')` instead of `c('Sepal.Length')` use
`strict=FALSE`.
Somehow at some point the yml file got corrupted. I have no idea what I was doing or how I got to this. I can't reproduce it. I'm reporting just in case you run into it too so that at least you know it's happened to someone else:
> write_vc(iris, 'iris')
Error in yaml.load(string, error.label = error.label, ...) :
(<redacted>/iris.yml) Scanner error: while scanning a simple key at line 15, column 5 could not find expected ':' at line 16, column 1
> list.files(pattern='do-not-remove-me')
character(0)
> writeLines('I am extremely important!', 'do-not-remove-me.yml')
> list.files(pattern='do-not-remove-me')
[1] "do-not-remove-me.yml"
> prune_meta('.', '.')
> list.files(pattern='do-not-remove-me')
character(0)
We just nuked our critical yml file. While the very careful user would realize this, many would not. To highlight this is a real life issue, I just noticed this:
Unstaged changes:
Deleted: _pkgdown.yml
Deleted: appveyor.yml
Please see the top level File Storage comments.
The caveats noted in the workflow vignette need to appear in the function documentation as well. I'm thinking in particular about:
Caveat: when applied on a path, it will remove all data files, without warning. Even when the path points to a git repository. So use rm_data() and prune_meta() with care.
More generally, what is the purpose of being able to rm_data
the data
separately from the meta data, either in the current guise or in the
pre-prune_meta
guise? It seems in the examples we either are over-writing
data with new data, or deleting both the data and meta data.
You could simplify the user facing interface by replacing both functions with
rm_vc
to remove both the yml and tsv files at the same time. If the concern
is that you are trying to manage a situation where there are a mix of yml and
tsv files floating around that may not be of the vc variety mixed in with vc
ones, then I think there are bigger issues as noted abovce.
Adding a pattern
parameter that matches that of list.files
could be useful.
Title should be clearer that it is factor levels in git2rdata
meta data that
is relabeled, not the factor itself.
The description stops short of actually describing what the function does.
Add a reference to the "Relabelling a Factor" section of "Optimizing storage for
..." vignette as you do in rm_data
for the workflow.
(I have additional suggestions listed below, but my approval is not contingent on their completion.)
A reminder that I was not able to fully test all functionality claims.
I recommended updating the README to reflect the explanation of the problem that the software is designed to solve provided in the previous version of the vignette. The README has been updated. I would suggest further editing the Rationale section so that it is framed in terms of problems and solutions.
Again, I see three main rationales:
1) version control of data
2) optimization of data storage
3) meta data storage to preserve data types, factor levels, etc.
Currently more attention is paid in the README to meta data (3) rather than version control (1) and optimization (2).
Finally, I recommend adding another section, perhaps called "Getting Started," to the README that provides a short summary of the contents of each vignette. As such, the theory/rationale for the package would be separate from the how-to.
I mentioned that build_vignettes = TRUE
wasn't working. The installation instructions were updated with clear instructions that include a workaround for installing the vignettes. (Changes are happening with the remotes
package so I wouldn't be surprised if the method for installing vignettes changes again soon. Mentioning as a heads-up to keep an eye on this, especially given the high quality of the vignettes -- important to make sure that they're seen!)
I noted that more comments would be helpful to guide the user through the examples. Clear examples with comments are now provided.
I noted that there was no CONTRIBUTING file nor contributing guidelines in the README. A CONTRIBUTING file was added to the .github
folder with a link in the README. (I think the root directory is a better place for CONTRIBUTING since it's more likely to be seen earlier; it won't pop up from .github
until the contributor submits a pull request as far as I can tell.)
I asked about the benefits of the optimization that the package provides. An excellent vignette, Efficiency in terms of storage and time, was added that addresses this issue in depth.
I suggested changing the package name. The author considered the issue and holds that the name is appropriate.
rm_data()
and prune_meta()
I asked about the benefit of using rm_data()
. The author responded: "The rm_data()
and prune_meta()
are useful in an automated workflow where changes in the raw data can lead to the removal of data sets. First apply rm_data()
to remove all tsv files. Then write all the data using write_vc()
. This is (re)create the tsv files which are still needed. Since all tsv were removed at the start, the ones we don't need any more are no longer present. The final step is to use prune_meta()
to remove leftover yml files with no matching tsv file...."
Clearly prune_meta()
serves a purpose in looking for matches. I am also convinced that rm_data()
is useful; it's harder than I remembered to remove all files of a certain type from R. (Can't do file.remove("*.tsv")
for example.) These functions have additional benefits such as defaulting to recursive = TRUE
.
I asked about an option to turn off sorting. The author responded: "Making the sorting optional is OK when not storing the files under version control. However, then it is up to the user to use the sorting when writing to a file path and doing the version control outside git2rdata
diffs..." Currently a warning is given if no sorting parameter is supplied. I believe that previously an error was thrown; I think the warning is a better option here.
pull()
conflictThe pull()
function reexported from git2r
conflicts with dplyr::pull()
. Not much you can do here, just flagging it.
Should some of the warnings perhaps be messages instead? "Warning" seems too strong for example in this case:
library(git2rdata)
write_vc(iris[1:6, ], file = "iris", sorting = "Sepal.Length")
#> 11d423327dc1f67e16183dbd9871ad931d3b9689
#> "iris.tsv"
#> fb962b490dbe4038512e52b72b8182e11b49a60d
#> "iris.yml"
write_vc(
iris[1:6, -2], "iris", sorting = "Sepal.Length", strict = FALSE
)
#> Warning in write_vc.character(iris[1:6, -2], "iris", sorting = "Sepal.Length", : new data has a different number of variables
#> deleted variables: Sepal.Width
#> 1a649561f36566acf21c2f167e69abc54907badb
#> "iris.tsv"
#> af0219b9d8ebdea4d30c833b14d3fadc22a2f750
#> "iris.yml"
Created on 2019-04-17 by the reprex package (v0.2.1.9000)
It appears to be common practice to include warnings with errors though it seems to me it would be better not to get the warnings, so that the error isn't lost in the shuffle. For example:
> write_vc(iris[1,], "iris")
1d7d77ee4b27932996ebcfcba2c25ec9c5aa7630
"iris.tsv"
38fdea6690db69a1da66fa5f06f47b37cf3cfccc
"iris.yml"
Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting) :
no sorting applied
> write_vc(iris[,2:4], "iris" )
Error: new data has a different number of variables
deleted variables: Sepal.Length, Species
In addition: Warning message:
In meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, :
no sorting applied
Also, it would be helpful to get a clear message that on error the data file was not changed.
read_vc()
has the same examples as write_vc()
, which makes sense since they work together though you could cut the write_vc()
examples that aren't followed by read_vc()
"Getting started"
Small comment:
library(git2rdata)
write_vc(x = x, file = "first_test", root = path, strict = FALSE)
produces the "no sorting applied" warning described above. It would be helpful if a link were provided here to the optimization vignette where this is clearly explained. (I tend to think something went wrong if I get a warning running vignette code without a heads-up.)
Overall, the vignettes are incredibly informative and add great value to the package. As noted above, it would be helpful to provide short summaries of the vignettes in the README, with a suggested order, such as:
"Getting Started": ...
"Optimizing storage for version control" ...
"Suggested workflow for storing a variable set of dataframes under version control" ...
"Efficiency in terms of storage and time" ... performance benchmarks...
Thanks to both @brodieG and @jtr13 for your detailed reviews. Your comments allow me to improve the quality of the package. I'll look into them and notify you when a next version is available. I'll giving an oral presentation on git2rdata
at useR!2019. It would be nice to have the package on rOpenSci by then.
Below is a response on a few issues raised by @brodieG. Can I get some feedback on this as some these potentially require a major overhaul of the package.
git2r
. We mentioned the name git2rdata
and the author of git2r
gave a thumbs up.Using a custom file format has indeed the benefit that the user is less likely to tamper with the (meta)data. However, a comment on an earlier version explicit asks to store the data in an open standard format.
IMHO, the benefits using a clear open standard outweighs the fact to make it tamper proof. Image the case in which the data was send to someone without knowledge of git2rdata
, or somebody who wants to use the data with other software. This is doable with the current format. Using unknown 'scary sounding' extensions will make that much harder, even when the data is stored in plain text format.
git2rdata
does not aim to avoid any tampering with the data. It rather make it easier to document the history using a VCS.
Currently, I haven't tested how read_vc()
reacts when the data and metadata no longer match. Some checks are easy to implement in read_vc()
e.g. changes in the metadata, new variables added to the data, missing variables in the data, new order of the variables in the data, order of the observations, ...
I'm not sure if we should prevent user to make changes to the (information content of the) data itself using other software. Think of using the Github website to correct typo's in the data, to remove observations from the data or to even add a few observations. If these changes corrupt the data, the user can revert them use the VCS or the backup.
The rm_data()
and prune_meta()
combo are only relevant in a workflow using a VSC in which a variable number of object are written. Using these functions in combination with a VSC is semi-safe as the user can revert the changes.
I could make these functions safer by adding something like an git2rdata.index
. Such file would record all files written by git2rdata
. rm_data()
and prune_meta()
can use this information to limit the removal of file to those created by git2rdata
.
IMHO, the issue is rather that print.data.frame
doesn't show the time zone.
library(git2rdata)
x <- data.frame(TS = as.POSIXct('1970-01-01'))
fn <- write_vc(x, 'hello')
## Warning in meta.data.frame(x, optimize = optimize, na = na, sorting =
## sorting, : no sorting applied
z <- read_vc('hello')
z
## TS
## 1 1969-12-31 23:00:00
z$TS
## [1] "1969-12-31 23:00:00 UTC"
Sys.timezone()
## [1] "Europe/Brussels"
For the badge I'm going off of the guide:
Badges for continuous integration and test coverage, the badge for rOpenSci peer-review once it has started (see below)
It's not critical that you have a custom "format" for git2rdata. I will note though that the "format" I propose still stores the data in open format in a completely accessible way. Diffs will show it, CSV readers will be able to read it, etc. I'm pretty sure that the concern would only really exist in the case where the above isn't true, where you need specialized tools to access the data.
Fundamentally the git2rdata object is a new non-standard format as it is the combination of the data and the meta data. By putting the files together in a folder you are making that format explicit and giving the signal to users that if you mess with the data, you may well break the git2rdata object. You can even have a README.txt in the folder explaining this instead of using a scary extension. This is probably better anyway, and will allow recipients access to the data with the appropriate warning.
Finally, if you still prefer to have the yml and tsv files floating around, I won't stop you so long as you take measures to ensure that the file pruning and deletion only affects files that were created by git2rdata
. Do note that there is an inherent fragility at having an additional meta data file floating around. What happens if someone sends you a new data/meta-data set? If you have a folder then you are all set. An alternative would be to add a UUID in the yml so that you can be sure (or at least very certain) that the yml actually belongs to git2rdata. Additionally, it will be particularly important that you check for data / meta-data corruption, although I would personally put those checks in under any implementation.
To recap, based on your feedback and further thought, this is what I would have the folder look like for e.g. iris
:
iris (or maybe iris.git2rdata)
+ README.txt
+ git2rdata.version # text file with git2rdata version, see below
+ iris.tsv
+ iris.yml # contains hash of data file
The git2rdata.version
file will allow you to make changes in the future to the directory format as you can check it on read, and if necessary, convert to the new format.
Version 0.0.4 is ready for review.
write_vc()
stores the git2rdata
version number to the metadata. Use upgrade_data()
to update existing data.read_vc()
checks the meta data hash. A mismatch results in an error.is_git2rmeta()
validates metadata.list_data()
lists files with valid metadata. rm_data()
and prune_meta()
remove files with valid metadata. Other files are untouched.list_data()
, rm_data()
and prune_meta()
.write_vc()
and relabel()
handle empty strings (''
) in characters and factors (#24).read_vc()
no longer treats #
as a comment character.read_vc()
handles non ASCII characters on Windows.root
containing regex expressions.vignette("workflow", package = "git2rdata")
.vignette("efficiency", package = "git2rdata")
vignette("plain_text", package = "git2rdata")
print.data.frame()
doesn't show the timezone.root
is removed from a file path without using regex.if (isTRUE(x)) {
is replaced by is.flag(x); if (x) {
.tz
argument is removed from format()
in meta.Date()
.order()
uses unname(x[sorting])
.vignette("workflow", package = "git2rdata")
gained a section "Structuring Git2rdata Objects Within a Project".git2rdata
will result in a warning. Since we recommend to use version control, the user can revert unwanted changes using version control.git2rdata
helps in making the analysis reproducible and traceable. It doesn't enforce it. That would be hard to do with forcing the user in a particular workflow. vignette("workflow", package = "git2rdata")
suggests two possible workflows.git2rdata:::meta.character(1:10)
. rm_data()
and prune_meta()
vignette("workflow", package = "git2rdata")
.base
, readr
and data.table
all require the user to specify the filename..github
as suggested in inbo/git2rdata#10.strict = FALSE
read_vc()
and write_vc()
man-roxygen
.From a quick skim this looks good. I'll review more thoroughly sometime next week and report back.
Looks good to me. I did not test all the new functionality, but did confirm that the accidental yml removal which was the blocking issue is fixed. It also does look like all the author-accepted suggestions are indeed in the new version, and I have no concerns with the author-rejected suggestions. This was all tested against v0.0.4 tag.
Nice job finding corner cases in the new duplicate function.
Some minor comments, none of which gets in the way of approval:
@noamross next steps?
Approved! Thanks @ThierryO for submitting and @brodieG and @jtr13 for your reviews! I know this has been a long road, but it's been very thorough and we're glad to have this useful and well-vetted package in our collection. @ThierryO, should you choose to address the remaining small comments, you may, but we are good to go.
To-dos:
ropensci
URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname)
.Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent). More info on this here.
In our developer's guide, this chapter starts the 3rd section that's about guidance for maintenance after acceptance.
Welcome aboard! We'd love to host a blog post about your package - either a short introduction to it with one example or a longer post with some narrative about its development or something you learned, and an example of its use. If you are interested, review the instructions, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.
Thanks for accepting the package.
And thanks to @brodieG and @jtr13 for their reviews which improved the package a lot. I'm happy to acknowledge this by adding you as reviewer in the DESCRIPTION.
The README on GitHub starts by referring to the pkgdown website of the package. There the vignette()
code is automatically linked to the HTML vignette. Such a link is IMHO more stable than a hard coded link. Is this sufficient or do you prefer to have a hard coded link? Maybe the message on the pkgdown website should be more prominent.
I'll give a 15' talk on git2rdata
at the upcoming use!R 2019 conference. I could rework that presentation into a blog post. Would that be a good idea @stefaniebutland?
I did miss the message about the pkgdown site, or didn't make the connection in the next section. It's probably fine as it is. FWIW that link is broken right now.
I'll give a 15' talk on git2rdata at the upcoming use!R 2019 conference. I could rework that presentation into a blog post. Would that be a good idea @stefaniebutland?
That sounds good @ThierryO. Since use!R is July 9-12, we could publish prior to your talk so that people could refer to it during the conference. In the talk you could invite people to submit their use cases to our discussion forum. That might get your work and the package some extra attention. After your talk, you could add any new insights as a comment on the post.
To publish before use!R, let me know and submit a draft for review by Tuesday July 2. To publish after, submit a draft when you are ready.
Summary
What does this package do? (explain in 50 words or less): Store dataframes as plain text files along with metadata to ensure that attributes like factor levels are maintained. The dataframes are optimized in order to minimize both file size and diffs, making it useful in combination with version control.
Paste the full DESCRIPTION file inside a code block below:
https://github.com/inbo/git2rdata
reproducibility, because it help to store dataframes as plain text files without loss of information, while minimize file sizes and diff.
Anyone who wants to work with medium sized dataframes and have them under version control. This is useful in case of recurrent analysis on growing datasets. E.g. each year new data is added to the data and a new report is created. A snapshot of the raw data can be stored as plain text files under version control.
The initial idea was to add this functionality into
git2r
. After a discussion with the maintainer, we decide to create a separate package. We use functions fromread_r
and improve them by storing the metadata.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
[X] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[X] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
We use
read_vc()
rather thanvc_read()
because it matchesread.table()
,read_csv()
, ...If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@jennybc @stewid