tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 285 forks source link

Doc suggestion: demonstrate use of col_select with col_types #1377

Open zackw opened 2 years ago

zackw commented 2 years ago

I have a CSV file with many columns (exported from Canvas). Left to its own devices, read_csv produces the column spec shown at the end of this bug report. For some data crunching, I wanted to load only a subset of the columns, and many of the columns have similar names, so I wanted to do that with a general tidyselect expression. read_csv lets me do that with col_select=, great. But I also wanted to override some of the column types.

The documentation for col_types= and cols() made it sound like my choices were all bad:

In fact, there is a perfectly good fourth option:

But this is not at all clear from the documentation. I only tried it as a gamble.

Please add some text and maybe also examples to the documentation, demonstrating how col_select and col_types can be used together.

Column spec

cols(
  Student = col_character(),
  ID = col_double(),
  `SIS Login ID` = col_character(),
  Section = col_character(),
  `Written Assignment #1 (410713)` = col_character(),
  `Written Assignment #2 (412436)` = col_character(),
  `Written Assignment #3 (414305)` = col_double(),
  `Written Assignment #4 (416013)` = col_double(),
  `Written Assignment #5 (417893)` = col_double(),
  `Written Assignment #6 (419737)` = col_double(),
  `Written Assignment #7 (422534)` = col_double(),
  `Written Assignment #8 (424000)` = col_double(),
  `Written Assignment #11 (430113)` = col_double(),
  `Written Assignment #12 (431301)` = col_double(),
  `Written Assignment #9 (426527)` = col_double(),
  `Written Assignment #10 (428208)` = col_double(),
  `Day 2 Quiz (400398)` = col_double(),
  `Day 3 Quiz (400400)` = col_double(),
  `Day 4 Quiz (400411)` = col_double(),
  `Day 5 Quiz (400419)` = col_double(),
  `Day 6 Quiz (400414)` = col_double(),
  `Day 7 Quiz (400415)` = col_double(),
  `Day 8 Quiz (400410)` = col_double(),
  `Day 17 Quiz (400407)` = col_double(),
  `Day 9 Quiz (400413)` = col_double(),
  `Day 10 Quiz (400408)` = col_double(),
  `Day 11 Quiz (400401)` = col_double(),
  `Day 12 Quiz (400404)` = col_double(),
  `Day 13 Quiz (400421)` = col_double(),
  `Day 14 Quiz (400409)` = col_double(),
  `Day 15 Quiz (400405)` = col_double(),
  `Day 16 Quiz (400406)` = col_double(),
  `Day 18 Quiz (400402)` = col_double(),
  `Day 19 Quiz (400403)` = col_double(),
  `Day 20 Quiz (400397)` = col_double(),
  `Day 21 Quiz (400420)` = col_double(),
  `Day 22 Quiz (400422)` = col_double(),
  `Day 23 Quiz (400418)` = col_double(),
  `Day 24 Quiz (400423)` = col_double(),
  `Day 25 Quiz (400399)` = col_double(),
  `Assignments Current Score` = col_character(),
  `Assignments Unposted Current Score` = col_character(),
  `Assignments Final Score` = col_character(),
  `Assignments Unposted Final Score` = col_character(),
  `Imported Assignments Current Score` = col_character(),
  `Imported Assignments Unposted Current Score` = col_character(),
  `Imported Assignments Final Score` = col_character(),
  `Imported Assignments Unposted Final Score` = col_character(),
  `Current Score` = col_character(),
  `Unposted Current Score` = col_character(),
  `Final Score` = col_character(),
  `Unposted Final Score` = col_character()
)
twest820 commented 2 years ago

+1 for this. The present lack of documentation makes col_select rather harder to use than it needs to be. I checked and couldn't find examples in this repo and searching Stackoverflow and elsewhere suggest actual use is mostly by column indexes, which isn't what the documentation says.

I've also found the tidyselect integration prone to failure and the tidyselect developer is indicating many (most?) read_delim() use cases cannot be supported because they do not conform to tidyselect semantics. My testing of their proposed workarounds suggests those too are often broken. So it appears a team to team conversation within tidyverse would be valuable in enabling the col_select documentation to clarify what is and isn't supported.

jennybc commented 2 years ago

@zackw It would be super helpful to see a concrete example of this:

Use col_select and col_types together, give a cols() spec that covers only the columns I care about, use .default to avoid hardcoding parts of the name that might vary, and there won't be any junk warnings.

Having a read world example (that perhaps needs to be simplified) gives a big head start on new examples.

@twest820 In a more general, sense I have the same comment / request for you. Sharing concrete examples where you get frustrated is very helpful to us.

zackw commented 2 years ago

@jennybc Sure thing! Suppose the file "canvas-scores.csv" conforms to the the column specification from my original report, then something like

scores <- read_csv("canvas-scores.csv",
    col_select=Student | `SIS Login ID` | starts_with("Written"),
    col_types=cols(Student=col_character(), `SIS Login ID`=col_character(), .default=col_double())
)

is what I was talking about. (Look carefully at the column spec in the original post and notice that col_guess, left to its own devices, did not pick col_double for some of the "Written ..." columns. This is because of a couple rows of garbage in the export. Unfortunately I cannot share actual data as it's confidential (student raw scores on homework).)

The key thing that is not clear from the existing documentation is that if you use col_select then you only have to name the selected columns in col_types=cols(...).