Open zackw opened 2 years ago
+1 for this. The present lack of documentation makes col_select
rather harder to use than it needs to be. I checked and couldn't find examples in this repo and searching Stackoverflow and elsewhere suggest actual use is mostly by column indexes, which isn't what the documentation says.
I've also found the tidyselect integration prone to failure and the tidyselect developer is indicating many (most?) read_delim()
use cases cannot be supported because they do not conform to tidyselect semantics. My testing of their proposed workarounds suggests those too are often broken. So it appears a team to team conversation within tidyverse would be valuable in enabling the col_select
documentation to clarify what is and isn't supported.
@zackw It would be super helpful to see a concrete example of this:
Use
col_select
andcol_types
together, give acols()
spec that covers only the columns I care about, use.default
to avoid hardcoding parts of the name that might vary, and there won't be any junk warnings.
Having a read world example (that perhaps needs to be simplified) gives a big head start on new examples.
@twest820 In a more general, sense I have the same comment / request for you. Sharing concrete examples where you get frustrated is very helpful to us.
@jennybc Sure thing! Suppose the file "canvas-scores.csv" conforms to the the column specification from my original report, then something like
scores <- read_csv("canvas-scores.csv",
col_select=Student | `SIS Login ID` | starts_with("Written"),
col_types=cols(Student=col_character(), `SIS Login ID`=col_character(), .default=col_double())
)
is what I was talking about. (Look carefully at the column spec in the original post and notice that col_guess, left to its own devices, did not pick col_double for some of the "Written ..." columns. This is because of a couple rows of garbage in the export. Unfortunately I cannot share actual data as it's confidential (student raw scores on homework).)
The key thing that is not clear from the existing documentation is that if you use col_select
then you only have to name the selected columns in col_types=cols(...)
.
I have a CSV file with many columns (exported from Canvas). Left to its own devices,
read_csv
produces the column spec shown at the end of this bug report. For some data crunching, I wanted to load only a subset of the columns, and many of the columns have similar names, so I wanted to do that with a general tidyselect expression.read_csv
lets me do that withcol_select=
, great. But I also wanted to override some of the column types.The documentation for
col_types=
andcols()
made it sound like my choices were all bad:cols_only
but write down all of the long clunky names for the columns I do care about.cols(..., .default=)
and put up with lots of parser warnings.In fact, there is a perfectly good fourth option:
col_select
andcol_types
together, give acols()
spec that covers only the columns I care about, use.default
to avoid hardcoding parts of the name that might vary, and there won't be any junk warnings.But this is not at all clear from the documentation. I only tried it as a gamble.
Please add some text and maybe also examples to the documentation, demonstrating how
col_select
andcol_types
can be used together.Column spec