Parallel arkdb, with examples

1beb commented 2 years ago

This is a candidate example for writing windowed database out to files in parallel.

Backgrounder:

Currently, the only way to run in parallel is to run multiple tables at a time. But what if you have an exceptionally large table? This pull request includes a new function window_parallel that allows you to run a large table in parallel.

Key points:

Database connections must be created "within" the parallel sessions, so instead of passing a database object, one must pass a function that creates a database object.
Naming of parquet files was sequential in nature, but there's an edge case where the same number could be written by a number of processes. Naming has been changed to use tmpfile() format.
This sample uses future.apply, however, any parallel function could be substituted (furrr, etc).

TODO:

[x] Update documentation for arkdb
[x] Update tests
[x] Split tests into separate files allowing for more focused devtools::test_file() usage
[x] Strip message(nrow(data))
[x] Lints

Refs #21 because it supports parallelization at a different level.

1beb commented 2 years ago

@cboettig This is ready for review.

1beb commented 2 years ago

@cboettig ready for round 2

cboettig commented 2 years ago

Looks good! My only other thought is maybe to mention in the README as well that ark is automatically selecting streamable_parquet(); otherwise the reader might assume from the syntax that it is still writing out with the usual default.

1beb commented 2 years ago

I actually went the other way. Instead of making the decision for someone, I chose to stop() if they used window-parallel without straeamable_parquet. I'm not sure what the best decision is here. I don't think I'd want to get parquet if I was expecting the default. Can't guarantee people will pay attention to that detail in a readme, but they will attempt to rectify a stop. Maybe?

cboettig commented 2 years ago

That makes sense to me.

Only place I think may still be confusing, at least to me, is in "Strategy 1", it looks like this is not using streamable_parquet()? https://github.com/ropensci/arkdb/pull/48/files#diff-72778b58969c8ca8268402860b0e003e3d213a26c812bc9f9b928395c284c99fR139

cboettig commented 2 years ago

excellent, this looks good to me!

ropensci / arkdb

Parallel arkdb, with examples #48