Datasets search - Githubissues

elinw commented 6 years ago

When I'm writing tutorials or documentation or when I'm teaching I often fall back on the same sample data sets over and over. At the same time, when I need something specific such as an ordered factor I have to search around to find one. I try to stick to the base datasets. I was thinking that it would be neat to have something (a package or a shiny app or a combination) that would let you search for a specific class of data structure (data frame, matrix, ts, dist, cube etc (there are a lot)) an also for specific variable types for those types that support multiple types. Maybe also experimental versus observational? https://vincentarelbundock.github.io/Rdatasets/datasets.html has a list of the data sets, but the purpose of that archive is more to put them all into csv format in a consistent manner.

An added bonus would be to be able to make the api generic enough to search other packages but my initial goal would be the ones in datasets.

boshek commented 6 years ago

Cool idea! Trying to figure if I understand correctly. Do you mean something like:

check_dataset(package = "datasets")
# A tibble: 8 x 6
  Package  Item          Title                                                        Rows  Cols Class     
  <chr>    <chr>         <chr>                                                       <int> <int> <chr>     
1 datasets ability.cov   Ability and Intelligence Tests                                  6     8 list      
2 datasets airmiles      Passenger Miles on Commercial US Airlines, 1937-1960           24     2 ts        
3 datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960                   144     2 ts        
4 datasets airquality    New York Air Quality Measurements                             153     6 data.frame
5 datasets anscombe      Anscombe's Quartet of 'Identical' Simple Linear Regressions    11     8 data.frame
6 datasets attenu        The Joyner-Boore Attenuation Data                             182     5 data.frame
7 datasets attitude      The Chatterjee-Price Attitude Data                             30     7 data.frame
8 datasets austres       Quarterly Time Series of the Number of Australian Residents    89     2 ts

Then you narrow down if you are look for data.frame, list etc?. So a function that a) returns a tibble and b) accepts a package(s) as an argument?

mpadge commented 6 years ago

Concur here too: cool idea! It would also be pretty straightforward to integrate that within flipper. The mooted extension to trawling all /man directories is technically straightforward, and could very easily include functionality to trawl any @docType data to enables those to be text-searched, and to group by return type (@format).

elinw commented 6 years ago

Cool, yes something similar to what @boshek has, I started playing a bit just to see what the complications would be. My basic idea would be to be able to

Search for a data set of a particular type (e.g. data frame, ts, mts, matrix etc)
Be able to search (within data frames I guess) for presence of variables with specific classes. So if you take a package name as an argument get all the information about the data into a tibble and then you'd be able to say give me all the data frames with a factor.

So this is just a quick script for making a data frame from the core data. I wanted to see what some of the complications would be and they are having spaces + extra words in the Item field and having multiple classes.

dataset_list <- data(package="datasets")
datasets_df <- as.data.frame(dataset_list[["results"]], stringsAsFactors = FALSE)
datasets_df$short <- gsub( " .*$", "", datasets_df$Item )

for (i in 1:nrow(datasets_df)){
  dataset_name <- get(datasets_df$short[i])
  # Get the first class name when there is more than one. 
  class_name <- class(dataset_name)
  datasets_df$class[i] <- class(dataset_name)[1]
  datasets_df$n_classes[i] <- length(class(dataset_name))
}

And then something like the below to get the classes but the question would be how to organize the information to make it most useful. For example maybe something like a set of logical variables: has_numeric, has_factor, has_logical, has_integer, has_character etc.

# Figure out what would work best for people in terms of searching
unlist(lapply(get(datasets_df$Item[i]), class))

jtr13 commented 6 years ago

Love this idea. Beyond class, it would be helpful to have information about the data types. Often I need several categorical variables, and while I do love the Titanic dataset, some more diversity would be a good thing. When writing exams I searched through the Sleuth3 manual for particular criteria but it was very time-consuming.

noamross commented 6 years ago

A helpful starting point might be last year's project examining data packages on CRAN: https://github.com/ropenscilabs/data-packages.

elinw commented 6 years ago

@jtr13 That's what I mean by class. There are so few ordered factors! So actually it would be good to know the number of each type. E.g. 3 factors, 2 ordered factors,5 numeric. I agree that it's the combinations that get really frustrating. When you want a simple example having to convert types can be a distraction from the main lesson.

@noamross If packages documented like that it would be cool and we could definitely include in a dashboard. We could at least provide a url for the description (although we can also try to scrape them).The other thing is packages that wrap APIs for accessing data. The main thing is to make it automated.

Then maybe if we have a sense of what is there that let's us think about what's missing.

jtr13 commented 6 years ago

@elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)

elinw commented 6 years ago

All those “too much, too little, about right” questions for one …

Which leads to a whole other set of things.

One of the big issues for me in the base categorical data is that they have it formatted into table classes but I want my students to see them like they are a data frame meaning a more realistic s setting where there are variables of at least two types.

On Apr 24, 2018, at 8:51 PM, Joyce Robbins notifications@github.com wrote:

@elinw https://github.com/elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/unconf18/issues/26#issuecomment-384125492, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfUTazsvs9GTAV4Z87zrR3YKNzUbQks5tr8iegaJpZM4TXcvS.

jtr13 commented 6 years ago

I've never had to use ordered factors for that kind of data for my purposes (usually visualization). I just order the levels of regular factors.

laderast commented 6 years ago

Cool idea! One thought might be that oftentimes, when I'm looking for a teaching dataset, I'm looking for the presence of variable relationships in the data, such as smoking status (categorical) vs. BMI (continuous). So could this be another way of classifying the datasets?

elinw commented 6 years ago

Yes so that's what I was trying to say about getting the classes of the variables for the data frames. https://github.com/elinw/dataestsearch/blob/master/R/datasetsearch.R

Is a concept but not that well coded (loops!! ) ... and it doesn't handle getting the variable types for tibbles but it does work for data frames. I mean this is just a concept but if we have a bunch of people we could make it really nice and figure out what is useful.

laderast commented 6 years ago

Ah, ok, that makes sense. I did something similar with a shiny workshop in identifying variables from a data.frame so that factor, character, and continuous variables would populate the right dropdowns for any dataset that was loaded into an app. It's the same idea as your code: https://github.com/laderast/gradual_shiny/blob/master/03_observe_update/helper.R

elinw commented 6 years ago

Summary: Build a way to search sample data sets in R packages to identify packages with different characteristics such as the format of the data set (e.g. data frame, matrix, dist, ts) and where appropriate the types of variables (e.g. factor, numeric, ts).

ropensci / unconf18

Datasets search #26