Closed cboettig closed 5 years ago
Hi @cboettig. Thanks for the submission. I'll be editing this package.
I get errors in the tests or examples when running devtools::check(cran = FALSE, run_dont_test = TRUE)
. My guess is that those errors are in the \dontrun{}
sections of the examples. Could you please fix those errors before the review starts?
One issue raised by goodpractice::g()
. Could you look into this please?
fix this R CMD check NOTE: Namespace in Imports field not imported from:
‘dbplyr’ All declared Imports should be used.
I also note some genuine spelling errors. You may wish to look at devtools::spell_check()
yourself.
avialable ark.Rd:29
databaase unark.Rd:5,89
ouput ark.Rd:12
I will be looking for reviewers.
@lmullen I continue to learn tricks from you! should remember to run the donttest
s and spell checks. I think all these issues should be fixed in the current master
, https://github.com/cboettig/arkdb/commit/1a33432c71befda38364be05163e7ad275d11d94
@cboettig Thanks for fixing those. Everything looks good to start the review, and I've solicited reviewers.
Reviewers: @richfitz and @expectopatronum. Due date: 2018-07-06
Thanks to Verena and Rich for being willing to do this review. I've set a due date 3 weeks from now. Here are the reviewer's guide and the reviewer's template.
@cboettig Could you please add the rOpenSci review badge to your repository if you haven't done so already?
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 3
I was able to install the package without errors on Windows 10. I ran the following commands with the following output:
Warning message: vignette ‘arkdb’ not found
I did not yet tick the check box next to Packaging guidelines. As far as I can see most/all required items are fulfilled, but I have a few remarks:
@return
is not documented for arkAdditionally I have a few minor remarks that I would change in the code (unless I miss something and this is the better way to do it):
while (start <= end)
(in ark.R:71). Unless I miss some corner case there is no need for a repeat loop.In general it is a well-written, light-weight package. The intention of the package is clear and the documentation useful. To highlight just a few of the things I liked:
Thanks for the helpful review, @expectopatronum.
@expectopatronum Thanks for this awesome review :tada:. Very helpful. I thought I'd wait for @richfitz comments but might as well check these off now and asking a few follow-up questions. (Rich has given me a hint meanwhile on at least one other thing I should improve).
- no documentation/help for ?arkdb is provided
Good catch, fixed!
- consider adding tests for error cases
I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?
@return
is not documented for ark The order of parameters in ark(db_con, dir, lines) and unark(files, db_con, lines) seems not intuitive for me. Unless it makes more sense for using piping (magrittr) with db_con %>% ark and files %>% unark, I would change the order to (db_con, dir/files, lines)
Grouping these, since they are related (ark
returns a directory/path, now documented!, unark
returns the db_con
-- I think there's a good reason for this but I see the confusion too. As you noted, I expect most uses not to actually use the return object, so it is returned invisibly, as something that could potentially be used in piping. For instance, imagine using arkdb to migrate from, say, a MySQL DB (ick!) to an MonetDBLite database:
old <- dbConnect(MySQL(), ...)
new <- dbConnect(MonetDBLite(), "my_db")
old %>%
ark( tmpdir() ) %>%
fs::dir_ls() %>%
unark(new)
This is why ark
returns dir
while unark
returns db_con
, and is also why ark
takes the arguments in the order db_con
, dir
, while unark
takes files
first and db_con
as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination. ark
operates on an existing database and outputs files, unark
operates on files. What do you think?
repeat
vswhile
Yeah, looks like I could have used awhile
loop -- but as this is the block of code I stole from @richfitz , I'll let him comment as there may be a good case onrepeat
. I'm not well-informed on the difference here.
Okay, pushing these few edits, lemme know what you think!
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
RGF: As @cboettig notes above, this is adapted from a similar bit of code I wrote a few months back but I don't think that counts as a COI as I regularly hate on my own code...
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 1
I have not read the other review as I write this review (and having now looked though the other review it looks like Carl already mostly dealt with those comments). Apologies this is so late
This package provides an interface for loading data into a database from disk, and extracting data from a database to disk, via R, without needing to do the entire operation in memory. This will help with very large databases (on the order of hundreds of millions of rows or more) where fitting data into memory is not an option. It will also increase the responsiveness of such operations (probably at the cost of overall speed) making large scale import and extraction from R a more pleasant experience.
This package, with about a hundred lines of code and two exported functions, fits a "micropackage" description, which makes reviewing a lot easier.
The biggest flaw I see with the package is that it violates the Open–closed principle (I know this is not OOP but I feel this is one that really transcends any paradigm). Carl has used readr::write_tsv
and readr::read_tsv
for the archive/import - presumably they satisfy his personal preferences for speed/reliability/future compatibility. But there's no real reason why utils::read.csv
/utils::write.csv
might not be used, or datatable::fread
or even something like feather probably (I don't really see that readr
is a necessary strong dependency here either).
The package could be extended to take either 3 arguments (or some sort of wrapper object) that includes an extension and two functions (read/write). This interacts somewhat with the compression handling but I think it's probably still doable. As it is, if someone has an aversion to tsv
(e.g., they want to point something that requires csv
at the archived database) they cannot use the package without changing the source.
The example in the README
produces a significant number of warnings: Warning: Closing open result set, pending rows
- these should be silenced if expected, eliminated if not.
The progress and general chattiness needs to be tuneable (e.g., with a verbose
and a progress
argument. Ideally both would be tuneable as progress bars are not very log-friendly but messages are (you can see this play out a bit with the vignette build - the progress output ends up on the parent terminal for me)
The log output as it is not incredibly informative after it has been run:
Exporting in 50000 line chunks:
airlines
...Done! (in 0.005350828 secs)
Exporting in 50000 line chunks:
airports
...Done! (in 0.02871299 secs)
Exporting in 50000 line chunks:
flights
[\] chunk 7...Done! (in 5.914837 secs)
Exporting in 50000 line chunks:
planes
...Done! (in 0.05566597 secs)
Exporting in 50000 line chunks:
weather
...Done! (in 0.4096012 secs)
Consider a little bit of reordering and indenting to get to
Exporting 'artlines' in 50000 line chunks:
...Done! (in 0.005350828 secs)
Exporting 'airports' in 50000 line chunks:
...Done! (in 0.02871299 secs)
Exporting 'flights' in 50000 line chunks:
[\] chunk 7...Done! (in 5.914837 secs)
(and so on)
How is the default of lines
chosen? Is there some heuristic for best performance? My gut instinct is that as it increases in size, the performance will get better, to a point. 10K lines might well be at the point where increasing the chunk size has limited benefit but I am curious if you looked? (I think I settled around a similar number - perhaps 50K in my application).
In ark.R
if (is(db_con, "SQLiteConnection") |
is(db_con, "MySQLConnection") |
Sys.getenv("arkdb_windowing") == "FALSE") {
query <- paste("SELECT * FROM", tablename, "LIMIT",
lines, "OFFSET", (start-1)*lines)
} else {
## Postgres can do windowing
query <- paste("SELECT * FROM",
tablename,
"WHERE rownum BETWEEN",
(start - 1) * lines,
"AND",
start * lines)
}
(The |
should be ||
). But this also surely is not a complete test? There will be other DBI wrappers (thinking here of MariaDB, there are at least two Postgres DB wrappers - so there may be similar duplication among MySQL). I think that the polarity of this test is also incorrect. Every SQL dialect can do the first form so that should be the fallback, not the Postgres extension. So rather than testing "known not to support" test for "known to support". You might also include a package-cached capability test (keyed against class) for a SELECT * FROM table WHERE rownum BETWEEN 1 and 2
- if you don't get a syntax error it should be ok to use.
In unark.R:normalize_con
if(is(db_con, "src_dbi")){
I though that inherits
was still the way to test this?
Your postgres windowed code is not tested. You might try that on travis?
Your normalize_con
code is not tested for the non-src_dbi
path
I am confused by the point of the untested early exit in unark_file
- when would this be wanted?
There are also a few error paths untested (see covr output)
In unark.R
d <- reader()
body <- paste0(c(header, d$data), "\n", collapse = "")
p$tick()
chunk <- readr::read_tsv(body, ...)
DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)
I might be missing something but surely one can compute the headers ahead of time (0'th iteration) and then do:
chunk <- reader()
names(chunk) <- header
p$tick()
DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)
with modifications to read_chunked
like
read_chunked <- function(con, n) {
assert_connection(con)
next_chunk <- readLines(con, n)
if (length(next_chunk) == 0L) {
# if we don't stop, we will hit an error!
stop("connection has already been completely read")
}
function() {
data <- next_chunk
next_chunk <<- read_tsv(con, n, header = FALSE) # <-- change here
complete <- length(next_chunk) == 0L # <-- probable change here too
list(data = data, complete = complete)
}
}
(This is quite possibly something I could do in my code that this comes from too!)
I would be probably inclined to split ark_chunk
into a "read" and "save" step.
@richfitz and @expectopatronum Thank you both for your thorough reviews.
@cboettig: You've already started on Verena's review, of course, but could you please address these two reviews and make any necessary changes by July 30?
@richfitz Thanks, this is excellent and the detail much appreciated! Addresses a bunch of things that have puzzled or bugged me so really appreciate your expertise here.
Just skimming review for now, but one point where I could use a bit of additional feedback is the hard-wriring read_tsv()
bit. I agree that should be abstracted out (it's issue #3)), but could use a bit more guidance/ideas on how to make it more general, how general it is worth making it.
For instance, supporting other delimited calls from readr
should be simple, since they all take the same argument structure. I think you're right about base types, though currently the read_
side of the code relies on readr
algorithm to determine the compression type. Could do this by parsing the file name instead; not sure I want to introduce a dependency on libmagic / wand
just to detect compression type by magic number...
Other parsers get more tricky -- e.g. I wanted to support fst
package to considerably speed up the process, but it doesn't seem to support the notion append
. I haven't looked into feather
feasibility-wise, but I think its authors say it's not a stable/archival format anyway, which cuts against the spirit of ark
(and is meanwhile slower than other modern binary formats like fst
). Of course compression is a bigger issue with many of these, particularly those that do not operate over the standard R connection . Ideas here are welcome but I'd be tempted to punt on this until a later date....
A somewhat related challenge that's not directly raised by the reviews is the issue of type stability. While some of the binary types I just mentioned support this out of the box, types (dates, factors, integer vs numeric etc) can change from a read-in / read-out. At the moment I feel this is an out-of-scope issue for a first release, but maybe a metadata file should be written with column types when archiving a database? (This could work within a family like readr
, harder to support this automatically across different engines -- the heterogeneity of file-based I/O really makes me appreciate the standardization provided by DBI / ODBC -- on the database side the interoperability just works without me having to do anything!
Up against a deadline now but should be able to get this done by July 30.
@cboettig - I've sketched out a proof-of-concept for generalising away from tsv here https://github.com/cboettig/arkdb/pull/5
While doing that I noticed this section:
compress <- match.arg(compress)
rs <- DBI::dbSendQuery(db_con, paste("SELECT COUNT(*) FROM", tablename))
size <- DBI::dbFetch(rs)
end <- size[[1]][[1]]
SELECT 1 FROM
for efficiency I believeDBI::dbGetQuery(db_con, ...)
?@richfitz Thanks much! I'll take a look tonight.
Quick question: do you mean SELECT COUNT(1) FROM
? My SQL is terrible, but my understanding is that it's the same to most interpreters (though I believe COUNT doesn't count NA/NULLs, not sure but perhaps I avoid this a bit by selecting all columns? Remarkable how difficult it is to ask a DB how many rows a table has -- apparently there are other DB specific ways that are only approximate. guess it comes from a world where DBs are constantly being written to and the notion of a number of rows is a more ephemeral concept anyhow)....
re dbGetQuery
... because I don't use DBI much and don't know how it works? is that the same as dbSendQuery + dbFetch (i.e. when no lazy eval of queries is desired, like here).
Don't listen to me on the SELECT 1
thing - I should know better to suggest microoptimisations without checking first:
microbenchmark::microbenchmark(
carl = DBI::dbGetQuery(db$con, "SELECT COUNT(*) FROM flights"),
rich = DBI::dbGetQuery(db$con, "SELECT COUNT(1) FROM flights"))
## Unit: milliseconds
## expr min lq mean median uq max neval
## carl 2.404747 2.536132 2.818021 2.687870 3.002489 4.136245 100
## rich 8.938237 9.306043 9.851665 9.705794 10.161812 12.921496 100
We ran into this problem at work but the tables in question have something on the order of 4 billion rows - so the approximate counts get quite useful...
Re dbGetQuery
- use the approach as above though otherwise (from ?dbSendQuery
)
The ‘dbSendQuery()’ method only submits and synchronously executes the SQL query to the database engine. It does not extract any records - for that you need to use the ‘dbFetch()’ method, and then you must call ‘dbClearResult()’ when you finish fetching the records you need. For interactive use, you should almost always prefer ‘dbGetQuery()’.
So you're missing the dbClearResult()
, which would ideally go inside a tryCatch
or withCallingHandlers
at which point you're better off just using dbGetQuery
...
I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?
As I come from software engineering I can think of several cases that could be tested (of which some might be an overkill). I would at least test what happens when the file/database connections are not working though. Other tests could be passing wrong parameters (e.g., names of tables that don't exist), ... But as I said, that might be an overkill :)
This is why
ark
returnsdir
whileunark
returnsdb_con
, and is also whyark
takes the arguments in the orderdb_con
,dir
, whileunark
takesfiles
first anddb_con
as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination.ark
operates on an existing database and outputs files,unark
operates on files. What do you think?
Yes, that's completely reasonable and fine with me!
Great job!
@richfitz @expectopatronum @lmullen
Okay, I think I've hit just about all of these now.
@expectopatronum I have added error handling assertions for the argument types at the top of ark
and unark
.
@richfitz I've summarized my replies to your issues in https://github.com/cboettig/arkdb/issues/6. In particular, see https://github.com/cboettig/arkdb/pull/10 which implements your pluggable formats (supporting csv
and tsv
and base methods), and https://github.com/cboettig/arkdb/pull/8, which I think largely sidesteps the issue about the different routines trying to window the DB queries (a big thanks to @krlmlr for the pointers there). The original methods are still supported as an option. Messaging and error handling is also improved following the @richfitz suggestions.
Okay, I think this is a much better package now, though there are no doubt things that could be better still or new issues I've introduced. Over to you all now; thanks for all the input and looking forward to your feedback.
@cboettig Perfect, I ticked the final checkbox. From my point of view the package is good to go!
Glad to see these revisions, @cboettig.
@expectopatronum Thanks very much for completing your thorough review.
@richfitz Thanks for going back and forth to improve this package. Would you be able to look over Carl's changes within roughly one week?
Looks good to me!
Thanks, @richfitz and @expectopatronum. @cboettig, we will consider this package successfully onboarded. :rocket:
Carl, could you please work on the following?
To-dos:
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/arkdb?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/arkdb)
.Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent). More info on this here.
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.
We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.
Thanks @lmullen!, I've ticked off the items above and transferred the package over to rOpenSci now. Badges / CI etc seem to be working fine. Planning a CRAN release soon.
@lmullen can this issue be closed? 🙏 I'm asking because of my packages registry experiments, where I noticed the review badge of arkdb
was still orange. 😉
@maelle Yes, definitely. Closing now.
Summary
Read in large, compressed text files into a databases in chunks. Write all tables in a database out into compressed tsv files.
https://github.com/cboettig/arkdb
data deposition! Because good archiving formats are crucial. Kinda crazy how many researchers and repositories distribute data only through APIs and uncompressed sql dumps.
Researchers working with large data files
I don't think so?
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names: