Closed eribul closed 3 years ago
Review:
I have made the following changes:
all_classcodes()
returns a tibble. categorize()
has been refactored into an S3-generic and returns data sets of the same class as the input (data.table/data.frame/tibble). as.classcodes()
returns as tibble with an additional class attribute.codify()
is often used to return large data sets which should be used in a following step by classify()
. I therefore think that the data.table
format is prefered here. I have refactored the function into S3-methods, however, to treat input as data.frame/data.table/tibble in a better way. I have also implemented a print.codified()
method, which prints the first n
rows as a tibble (possible to override with print(..., n = NULL)
, which will print the object as is [a data.table]). I think this might be a good compromise for most users. classify()
returns a matrix for effiency, since this object is always logical/boolean. I think this should be kept as is. The two methods as.data.frame.classified
and as.data.table.classified
should make it relatively simple to convert the output if desired (including tibbles inhereting from data.frame). summary.classcodes()
returns a tibble, which is also printed as such through print.summary.classcodes()
. codify()
till S3 och anpassa output!codify
codify.data.frame
codify.data.table
Your examples like ex_people are tibbles, but when categorize() or codify() is passed a tibble, it returns a data.table. This would be a surprising behavior for people using these packages within a tidyverse workflow. I think data.table is a terrific package, but there's not a reason to surprise users with the data type if they're not accustomed to it. (And the fact that the example datasets are tibbles rather than data.frames or data.tables adds to the inconsistency a bit).
I recommend ending the functions with something like
This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.