tibbles and data.tables

eribul commented 4 years ago

Your examples like ex_people are tibbles, but when categorize() or codify() is passed a tibble, it returns a data.table. This would be a surprising behavior for people using these packages within a tidyverse workflow. I think data.table is a terrific package, but there's not a reason to surprise users with the data type if they're not accustomed to it. (And the fact that the example datasets are tibbles rather than data.frames or data.tables adds to the inconsistency a bit).

I recommend ending the functions with something like

# Where data was the argument passed in, and ret is what's about to be returned
if (tibble::is_tibble(data)) {
  ret <- tibble::as_tibble(ret)
}

This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.

eribul commented 4 years ago

Review:

I have made the following changes:

included tibble as a dependency
all_classcodes() returns a tibble.
categorize() has been refactored into an S3-generic and returns data sets of the same class as the input (data.table/data.frame/tibble).
as.classcodes() returns as tibble with an additional class attribute.
codify() is often used to return large data sets which should be used in a following step by classify(). I therefore think that the data.table format is prefered here. I have refactored the function into S3-methods, however, to treat input as data.frame/data.table/tibble in a better way. I have also implemented a print.codified() method, which prints the first n rows as a tibble (possible to override with print(..., n = NULL), which will print the object as is [a data.table]). I think this might be a good compromise for most users.
classify() returns a matrix for effiency, since this object is always logical/boolean. I think this should be kept as is. The two methods as.data.frame.classified and as.data.table.classified should make it relatively simple to convert the output if desired (including tibbles inhereting from data.frame).
summary.classcodes() returns a tibble, which is also printed as such through print.summary.classcodes().
The vignettes have also been reformated with printed output from tibble objects.

eribul commented 4 years ago

[x] SKriv om codify() till S3 och anpassa output!
[x] codify
[x] codify.data.frame
[x] codify.data.table

ropensci / coder

tibbles and data.tables #121