ropensci / software-review

rOpenSci Software Peer Review.
290 stars 104 forks source link

submission: arkdb #224

Closed cboettig closed 5 years ago

cboettig commented 6 years ago

Summary

Read in large, compressed text files into a databases in chunks. Write all tables in a database out into compressed tsv files.

Package: arkdb
Version: 0.0.0.9000
Title: Archive and Unarchive Databases Using Flat Files
Description: Flat text files provide a more robust, compressable,
  and portable way to store tables.  This package provides convenient
  functions for exporting tables from relational database connections
  into compressed text files and streaming those text files back into
  a database without requiring the whole table to fit in working memory.
Authors@R: c(person("Carl", "Boettiger", 
                    email = "cboettig@gmail.com", 
                    role = c("aut", "cre", "cph"),
                    comment=c(ORCID = "0000-0002-1642-628X")),
              person("Richard", "FitzJohn", role = "ctb"))
URL: https://github.com/cboettig/arkdb
BugReports: https://github.com/cboettig/arkdb/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
ByteCompile: true
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
Roxygen: list(markdown = TRUE)
Imports: 
    dbplyr,
    DBI,
    readr,
    RSQLite,
    progress,
    dplyr,
    methods
Suggests: 
    nycflights13,
    testthat,
    knitr,
    covr,
    fs,
    rmarkdown

https://github.com/cboettig/arkdb

data deposition! Because good archiving formats are crucial. Kinda crazy how many researchers and repositories distribute data only through APIs and uncompressed sql dumps.

Researchers working with large data files

I don't think so?

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail


R CMD check results
0 errors | 0 warnings | 0 notes
lmullen commented 6 years ago

Editor checks:


Editor comments

Hi @cboettig. Thanks for the submission. I'll be editing this package.

I get errors in the tests or examples when running devtools::check(cran = FALSE, run_dont_test = TRUE). My guess is that those errors are in the \dontrun{} sections of the examples. Could you please fix those errors before the review starts?

One issue raised by goodpractice::g(). Could you look into this please?

fix this R CMD check NOTE: Namespace in Imports field not imported from:
    ‘dbplyr’ All declared Imports should be used.

I also note some genuine spelling errors. You may wish to look at devtools::spell_check() yourself.

avialable      ark.Rd:29
databaase      unark.Rd:5,89
ouput          ark.Rd:12

I will be looking for reviewers.

cboettig commented 6 years ago

@lmullen I continue to learn tricks from you! should remember to run the donttests and spell checks. I think all these issues should be fixed in the current master, https://github.com/cboettig/arkdb/commit/1a33432c71befda38364be05163e7ad275d11d94

lmullen commented 6 years ago

@cboettig Thanks for fixing those. Everything looks good to start the review, and I've solicited reviewers.

lmullen commented 6 years ago

Reviewers: @richfitz and @expectopatronum. Due date: 2018-07-06

Thanks to Verena and Rich for being willing to do this review. I've set a due date 3 weeks from now. Here are the reviewer's guide and the reviewer's template.

@cboettig Could you please add the rOpenSci review badge to your repository if you haven't done so already?

expectopatronum commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 3


Review Comments

I was able to install the package without errors on Windows 10. I ran the following commands with the following output:

gp_output

I did not yet tick the check box next to Packaging guidelines. As far as I can see most/all required items are fulfilled, but I have a few remarks:

Additionally I have a few minor remarks that I would change in the code (unless I miss something and this is the better way to do it):

In general it is a well-written, light-weight package. The intention of the package is clear and the documentation useful. To highlight just a few of the things I liked:

lmullen commented 6 years ago

Thanks for the helpful review, @expectopatronum.

cboettig commented 6 years ago

@expectopatronum Thanks for this awesome review :tada:. Very helpful. I thought I'd wait for @richfitz comments but might as well check these off now and asking a few follow-up questions. (Rich has given me a hint meanwhile on at least one other thing I should improve).

  • no documentation/help for ?arkdb is provided

Good catch, fixed!

  • consider adding tests for error cases

I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?

  • @return is not documented for ark The order of parameters in ark(db_con, dir, lines) and unark(files, db_con, lines) seems not intuitive for me. Unless it makes more sense for using piping (magrittr) with db_con %>% ark and files %>% unark, I would change the order to (db_con, dir/files, lines)

Grouping these, since they are related (ark returns a directory/path, now documented!, unark returns the db_con -- I think there's a good reason for this but I see the confusion too. As you noted, I expect most uses not to actually use the return object, so it is returned invisibly, as something that could potentially be used in piping. For instance, imagine using arkdb to migrate from, say, a MySQL DB (ick!) to an MonetDBLite database:

old <-  dbConnect(MySQL(), ...)
new <-  dbConnect(MonetDBLite(), "my_db")

old %>% 
  ark( tmpdir() ) %>% 
  fs::dir_ls() %>% 
  unark(new)

This is why ark returns dir while unark returns db_con, and is also why ark takes the arguments in the order db_con, dir, while unark takes files first and db_con as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination. ark operates on an existing database and outputs files, unark operates on files. What do you think?

repeat vs while Yeah, looks like I could have used a while loop -- but as this is the block of code I stole from @richfitz , I'll let him comment as there may be a good case on repeat. I'm not well-informed on the difference here.

Okay, pushing these few edits, lemme know what you think!

richfitz commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

RGF: As @cboettig notes above, this is adapted from a similar bit of code I wrote a few months back but I don't think that counts as a COI as I regularly hate on my own code...

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 1


Review Comments

I have not read the other review as I write this review (and having now looked though the other review it looks like Carl already mostly dealt with those comments). Apologies this is so late

This package provides an interface for loading data into a database from disk, and extracting data from a database to disk, via R, without needing to do the entire operation in memory. This will help with very large databases (on the order of hundreds of millions of rows or more) where fitting data into memory is not an option. It will also increase the responsiveness of such operations (probably at the cost of overall speed) making large scale import and extraction from R a more pleasant experience.

This package, with about a hundred lines of code and two exported functions, fits a "micropackage" description, which makes reviewing a lot easier.

Major comments

The biggest flaw I see with the package is that it violates the Open–closed principle (I know this is not OOP but I feel this is one that really transcends any paradigm). Carl has used readr::write_tsv and readr::read_tsv for the archive/import - presumably they satisfy his personal preferences for speed/reliability/future compatibility. But there's no real reason why utils::read.csv/utils::write.csv might not be used, or datatable::fread or even something like feather probably (I don't really see that readr is a necessary strong dependency here either).

The package could be extended to take either 3 arguments (or some sort of wrapper object) that includes an extension and two functions (read/write). This interacts somewhat with the compression handling but I think it's probably still doable. As it is, if someone has an aversion to tsv (e.g., they want to point something that requires csv at the archived database) they cannot use the package without changing the source.

Misc comments:

The example in the README produces a significant number of warnings: Warning: Closing open result set, pending rows - these should be silenced if expected, eliminated if not.

The progress and general chattiness needs to be tuneable (e.g., with a verbose and a progress argument. Ideally both would be tuneable as progress bars are not very log-friendly but messages are (you can see this play out a bit with the vignette build - the progress output ends up on the parent terminal for me)

The log output as it is not incredibly informative after it has been run:

Exporting in 50000 line chunks:
airlines
...Done! (in 0.005350828 secs)
Exporting in 50000 line chunks:
airports
...Done! (in 0.02871299 secs)
Exporting in 50000 line chunks:
flights
[\] chunk 7...Done! (in 5.914837 secs)
Exporting in 50000 line chunks:
planes
...Done! (in 0.05566597 secs)
Exporting in 50000 line chunks:
weather
...Done! (in 0.4096012 secs)

Consider a little bit of reordering and indenting to get to

Exporting 'artlines' in 50000 line chunks:
   ...Done! (in 0.005350828 secs)
Exporting 'airports' in 50000 line chunks:
   ...Done! (in 0.02871299 secs)
Exporting 'flights' in 50000 line chunks:
   [\] chunk 7...Done! (in 5.914837 secs)

(and so on)

How is the default of lines chosen? Is there some heuristic for best performance? My gut instinct is that as it increases in size, the performance will get better, to a point. 10K lines might well be at the point where increasing the chunk size has limited benefit but I am curious if you looked? (I think I settled around a similar number - perhaps 50K in my application).

In ark.R

  if (is(db_con, "SQLiteConnection") |
      is(db_con, "MySQLConnection") |
      Sys.getenv("arkdb_windowing") == "FALSE") {
    query <- paste("SELECT * FROM", tablename, "LIMIT", 
                   lines, "OFFSET", (start-1)*lines)
  } else {
  ## Postgres can do windowing
    query <- paste("SELECT * FROM", 
                   tablename, 
                    "WHERE rownum BETWEEN",
                   (start - 1) * lines, 
                   "AND", 
                   start * lines)
  }

(The | should be ||). But this also surely is not a complete test? There will be other DBI wrappers (thinking here of MariaDB, there are at least two Postgres DB wrappers - so there may be similar duplication among MySQL). I think that the polarity of this test is also incorrect. Every SQL dialect can do the first form so that should be the fallback, not the Postgres extension. So rather than testing "known not to support" test for "known to support". You might also include a package-cached capability test (keyed against class) for a SELECT * FROM table WHERE rownum BETWEEN 1 and 2 - if you don't get a syntax error it should be ok to use.

Super minor comments

In unark.R:normalize_con

if(is(db_con, "src_dbi")){

I though that inherits was still the way to test this?

Your postgres windowed code is not tested. You might try that on travis?

Your normalize_con code is not tested for the non-src_dbi path

I am confused by the point of the untested early exit in unark_file - when would this be wanted?

There are also a few error paths untested (see covr output)

In unark.R

    d <- reader()
    body <- paste0(c(header, d$data), "\n", collapse = "")
    p$tick()
    chunk <- readr::read_tsv(body, ...)
    DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)

I might be missing something but surely one can compute the headers ahead of time (0'th iteration) and then do:

    chunk <- reader()
    names(chunk) <- header
    p$tick()
    DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)

with modifications to read_chunked like

read_chunked <- function(con, n) {
  assert_connection(con)
  next_chunk <- readLines(con, n)
  if (length(next_chunk) == 0L) {
    # if we don't stop, we will hit an error!
    stop("connection has already been completely read")
  }
  function() {
    data <- next_chunk
    next_chunk <<- read_tsv(con, n, header = FALSE) # <-- change here
    complete <- length(next_chunk) == 0L # <-- probable change here too
    list(data = data, complete = complete)
  }
}

(This is quite possibly something I could do in my code that this comes from too!)

I would be probably inclined to split ark_chunk into a "read" and "save" step.

lmullen commented 6 years ago

@richfitz and @expectopatronum Thank you both for your thorough reviews.

@cboettig: You've already started on Verena's review, of course, but could you please address these two reviews and make any necessary changes by July 30?

cboettig commented 6 years ago

@richfitz Thanks, this is excellent and the detail much appreciated! Addresses a bunch of things that have puzzled or bugged me so really appreciate your expertise here.

Just skimming review for now, but one point where I could use a bit of additional feedback is the hard-wriring read_tsv() bit. I agree that should be abstracted out (it's issue #3)), but could use a bit more guidance/ideas on how to make it more general, how general it is worth making it.

For instance, supporting other delimited calls from readr should be simple, since they all take the same argument structure. I think you're right about base types, though currently the read_ side of the code relies on readr algorithm to determine the compression type. Could do this by parsing the file name instead; not sure I want to introduce a dependency on libmagic / wand just to detect compression type by magic number...

Other parsers get more tricky -- e.g. I wanted to support fst package to considerably speed up the process, but it doesn't seem to support the notion append. I haven't looked into feather feasibility-wise, but I think its authors say it's not a stable/archival format anyway, which cuts against the spirit of ark (and is meanwhile slower than other modern binary formats like fst). Of course compression is a bigger issue with many of these, particularly those that do not operate over the standard R connection . Ideas here are welcome but I'd be tempted to punt on this until a later date....

A somewhat related challenge that's not directly raised by the reviews is the issue of type stability. While some of the binary types I just mentioned support this out of the box, types (dates, factors, integer vs numeric etc) can change from a read-in / read-out. At the moment I feel this is an out-of-scope issue for a first release, but maybe a metadata file should be written with column types when archiving a database? (This could work within a family like readr, harder to support this automatically across different engines -- the heterogeneity of file-based I/O really makes me appreciate the standardization provided by DBI / ODBC -- on the database side the interoperability just works without me having to do anything!

Up against a deadline now but should be able to get this done by July 30.

richfitz commented 6 years ago

@cboettig - I've sketched out a proof-of-concept for generalising away from tsv here https://github.com/cboettig/arkdb/pull/5

While doing that I noticed this section:

  compress <- match.arg(compress)
  rs <- DBI::dbSendQuery(db_con, paste("SELECT COUNT(*) FROM", tablename))
  size <- DBI::dbFetch(rs)
  end <- size[[1]][[1]]
cboettig commented 6 years ago

@richfitz Thanks much! I'll take a look tonight.

Quick question: do you mean SELECT COUNT(1) FROM? My SQL is terrible, but my understanding is that it's the same to most interpreters (though I believe COUNT doesn't count NA/NULLs, not sure but perhaps I avoid this a bit by selecting all columns? Remarkable how difficult it is to ask a DB how many rows a table has -- apparently there are other DB specific ways that are only approximate. guess it comes from a world where DBs are constantly being written to and the notion of a number of rows is a more ephemeral concept anyhow)....

re dbGetQuery ... because I don't use DBI much and don't know how it works? is that the same as dbSendQuery + dbFetch (i.e. when no lazy eval of queries is desired, like here).

richfitz commented 6 years ago

Don't listen to me on the SELECT 1 thing - I should know better to suggest microoptimisations without checking first:

microbenchmark::microbenchmark(
  carl = DBI::dbGetQuery(db$con, "SELECT COUNT(*) FROM flights"),
  rich = DBI::dbGetQuery(db$con, "SELECT COUNT(1) FROM flights"))
## Unit: milliseconds
##  expr      min       lq     mean   median        uq       max neval
##  carl 2.404747 2.536132 2.818021 2.687870  3.002489  4.136245   100
##  rich 8.938237 9.306043 9.851665 9.705794 10.161812 12.921496   100

We ran into this problem at work but the tables in question have something on the order of 4 billion rows - so the approximate counts get quite useful...

Re dbGetQuery - use the approach as above though otherwise (from ?dbSendQuery)

The ‘dbSendQuery()’ method only submits and synchronously executes the SQL query to the database engine. It does not extract any records - for that you need to use the ‘dbFetch()’ method, and then you must call ‘dbClearResult()’ when you finish fetching the records you need. For interactive use, you should almost always prefer ‘dbGetQuery()’.

So you're missing the dbClearResult(), which would ideally go inside a tryCatch or withCallingHandlers at which point you're better off just using dbGetQuery...

expectopatronum commented 6 years ago

I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?

As I come from software engineering I can think of several cases that could be tested (of which some might be an overkill). I would at least test what happens when the file/database connections are not working though. Other tests could be passing wrong parameters (e.g., names of tables that don't exist), ... But as I said, that might be an overkill :)

This is why ark returns dir while unark returns db_con, and is also why ark takes the arguments in the order db_con, dir, while unark takes files first and db_con as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination. ark operates on an existing database and outputs files, unark operates on files. What do you think?

Yes, that's completely reasonable and fine with me!

Great job!

cboettig commented 6 years ago

@richfitz @expectopatronum @lmullen

Okay, I think I've hit just about all of these now.

@expectopatronum I have added error handling assertions for the argument types at the top of ark and unark.

@richfitz I've summarized my replies to your issues in https://github.com/cboettig/arkdb/issues/6. In particular, see https://github.com/cboettig/arkdb/pull/10 which implements your pluggable formats (supporting csv and tsv and base methods), and https://github.com/cboettig/arkdb/pull/8, which I think largely sidesteps the issue about the different routines trying to window the DB queries (a big thanks to @krlmlr for the pointers there). The original methods are still supported as an option. Messaging and error handling is also improved following the @richfitz suggestions.

Okay, I think this is a much better package now, though there are no doubt things that could be better still or new issues I've introduced. Over to you all now; thanks for all the input and looking forward to your feedback.

expectopatronum commented 6 years ago

@cboettig Perfect, I ticked the final checkbox. From my point of view the package is good to go!

lmullen commented 6 years ago

Glad to see these revisions, @cboettig.

@expectopatronum Thanks very much for completing your thorough review.

@richfitz Thanks for going back and forth to improve this package. Would you be able to look over Carl's changes within roughly one week?

richfitz commented 6 years ago

Looks good to me!

lmullen commented 6 years ago

Thanks, @richfitz and @expectopatronum. @cboettig, we will consider this package successfully onboarded. :rocket:

Carl, could you please work on the following?

To-dos:

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

cboettig commented 6 years ago

Thanks @lmullen!, I've ticked off the items above and transferred the package over to rOpenSci now. Badges / CI etc seem to be working fine. Planning a CRAN release soon.

maelle commented 5 years ago

@lmullen can this issue be closed? 🙏 I'm asking because of my packages registry experiments, where I noticed the review badge of arkdb was still orange. 😉

lmullen commented 5 years ago

@maelle Yes, definitely. Closing now.