tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

Clarifying help doc for "Callback Classes" #510

Closed bbrewington closed 7 years ago

bbrewington commented 8 years ago

I wasn't able to exactly figure out how to use the callback classes (I'm in the middle of fiddling / googling to find the answer); here are some questions I had:

For reference, here's the "Examples" section from the help doc for Callback Classes" from readr version 1.0.0:

## If given a regular function it is converted to a SideEffectChunkCallback

# view structure of each chunk
read_lines_chunked(readr_example("mtcars.csv"), str, chunk_size = 5)

# Print starting line of each chunk
f <- function(x, pos) print(pos)
read_lines_chunked(readr_example("mtcars.csv"), SideEffectChunkCallback$new(f), chunk_size = 5)

## If combined results are desired you can use the DataFrameCallback

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
bbrewington commented 8 years ago

Found the answer to the "$new" question here: https://cran.r-project.org/web/packages/R6/vignettes/Introduction.html

May just want to call out that the ChunkCallback (methods?) are R6 classes, and refer the user to that link? Pretty sure the average user isn't going to understand that off the bat.

hadley commented 8 years ago

Yeah, they're definitely not currently for the average user! We'll document more in the next version.

pos stands for position

pgensler commented 7 years ago

Is there any new documentation on this feature? I am trying to read in some data into R that definitely is chunked, and I'm a little lost on how to be using this callback function. Any help would be appreciated. As some sample data:

wine/name: 1981 Ch&#226;teau de Beaucastel Ch&#226;teauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Ch&#226;teau Pichon-Longueville Baron
wine/wineId: 3495
wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
hadley commented 7 years ago

That's the type of chunking that this class refers to you. readr won't help you much with this data.

pgensler commented 7 years ago

Isen't that the point of the read_delim_chunk function though? At the core, all is needed is to supply -chunk size, -file location -delimiters between chunks, which I have seen some people use as ------. Why is the dataframecallback a mandatory argument? I would imagine that this code should work for a usecase like this:

 beeradvocate <- readr::read_delim_chunked(file = "Desktop/file.txt", delim = "\n"
                                          chunk_size = 13)
jimhester commented 7 years ago

@pgensler The point of the chunk functions is not to read data that is delimited in chunks, it is to read and process normally delimited data in chunks, likely because reading the full dataset will exceed the amount of available memory.

readr is primarily for parsing rectangular data, you will need to use other means to parse this data.

bbrewington commented 7 years ago

Here's some helpful info showing examples on the callback classes - http://readr.tidyverse.org/reference/callback.html

pgensler commented 6 years ago

Thanks @jimhester for the clarification, this is really helpful.

I do think that it would be helpful to provide a basic description like this:

By default, R will attempt to read all data into memory in most circumstances. Large files(>1GB) can get choked with memory issues, hence read_delim_chunked is made to process a file in 'chunk's'. This allows for better memory management with the use of a Callback function on each 'chunk' of the file.

Even something like this would be more helpful, as it's a bit more descriptive around what exactly the purpose of the function is. Thoughts on this? I can put together a PR if that is easier for you, let me know.

It would be helpful to have a bit more documentation around when exactly to use this function.

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/