Open elinw opened 6 years ago
Cool idea! Trying to figure if I understand correctly. Do you mean something like:
check_dataset(package = "datasets")
# A tibble: 8 x 6
Package Item Title Rows Cols Class
<chr> <chr> <chr> <int> <int> <chr>
1 datasets ability.cov Ability and Intelligence Tests 6 8 list
2 datasets airmiles Passenger Miles on Commercial US Airlines, 1937-1960 24 2 ts
3 datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960 144 2 ts
4 datasets airquality New York Air Quality Measurements 153 6 data.frame
5 datasets anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions 11 8 data.frame
6 datasets attenu The Joyner-Boore Attenuation Data 182 5 data.frame
7 datasets attitude The Chatterjee-Price Attitude Data 30 7 data.frame
8 datasets austres Quarterly Time Series of the Number of Australian Residents 89 2 ts
Then you narrow down if you are look for data.frame, list etc?. So a function that a) returns a tibble and b) accepts a package(s) as an argument?
Concur here too: cool idea! It would also be pretty straightforward to integrate that within flipper
. The mooted extension to trawling all /man
directories is technically straightforward, and could very easily include functionality to trawl any @docType data
to enables those to be text-searched, and to group by return type (@format
).
Cool, yes something similar to what @boshek has, I started playing a bit just to see what the complications would be. My basic idea would be to be able to
So this is just a quick script for making a data frame from the core data. I wanted to see what some of the complications would be and they are having spaces + extra words in the Item field and having multiple classes.
dataset_list <- data(package="datasets")
datasets_df <- as.data.frame(dataset_list[["results"]], stringsAsFactors = FALSE)
datasets_df$short <- gsub( " .*$", "", datasets_df$Item )
for (i in 1:nrow(datasets_df)){
dataset_name <- get(datasets_df$short[i])
# Get the first class name when there is more than one.
class_name <- class(dataset_name)
datasets_df$class[i] <- class(dataset_name)[1]
datasets_df$n_classes[i] <- length(class(dataset_name))
}
And then something like the below to get the classes but the question would be how to organize the information to make it most useful. For example maybe something like a set of logical variables: has_numeric, has_factor, has_logical, has_integer, has_character etc.
# Figure out what would work best for people in terms of searching
unlist(lapply(get(datasets_df$Item[i]), class))
Love this idea. Beyond class, it would be helpful to have information about the data types. Often I need several categorical variables, and while I do love the Titanic dataset, some more diversity would be a good thing. When writing exams I searched through the Sleuth3 manual for particular criteria but it was very time-consuming.
A helpful starting point might be last year's project examining data packages on CRAN: https://github.com/ropenscilabs/data-packages.
@jtr13 That's what I mean by class. There are so few ordered factors! So actually it would be good to know the number of each type. E.g. 3 factors, 2 ordered factors,5 numeric. I agree that it's the combinations that get really frustrating. When you want a simple example having to convert types can be a distraction from the main lesson.
@noamross If packages documented like that it would be cool and we could definitely include in a dashboard. We could at least provide a url for the description (although we can also try to scrape them).The other thing is packages that wrap APIs for accessing data. The main thing is to make it automated.
Then maybe if we have a sense of what is there that let's us think about what's missing.
@elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)
All those “too much, too little, about right” questions for one …
Which leads to a whole other set of things.
One of the big issues for me in the base categorical data is that they have it formatted into table classes but I want my students to see them like they are a data frame meaning a more realistic s setting where there are variables of at least two types.
On Apr 24, 2018, at 8:51 PM, Joyce Robbins notifications@github.com wrote:
@elinw https://github.com/elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/unconf18/issues/26#issuecomment-384125492, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfUTazsvs9GTAV4Z87zrR3YKNzUbQks5tr8iegaJpZM4TXcvS.
I've never had to use ordered factors for that kind of data for my purposes (usually visualization). I just order the levels of regular factors.
Cool idea! One thought might be that oftentimes, when I'm looking for a teaching dataset, I'm looking for the presence of variable relationships in the data, such as smoking status (categorical) vs. BMI (continuous). So could this be another way of classifying the datasets?
Yes so that's what I was trying to say about getting the classes of the variables for the data frames. https://github.com/elinw/dataestsearch/blob/master/R/datasetsearch.R
Is a concept but not that well coded (loops!! ) ... and it doesn't handle getting the variable types for tibbles but it does work for data frames. I mean this is just a concept but if we have a bunch of people we could make it really nice and figure out what is useful.
Ah, ok, that makes sense. I did something similar with a shiny workshop in identifying variables from a data.frame
so that factor, character, and continuous variables would populate the right dropdowns for any dataset that was loaded into an app. It's the same idea as your code: https://github.com/laderast/gradual_shiny/blob/master/03_observe_update/helper.R
Summary: Build a way to search sample data sets in R packages to identify packages with different characteristics such as the format of the data set (e.g. data frame, matrix, dist, ts) and where appropriate the types of variables (e.g. factor, numeric, ts).
When I'm writing tutorials or documentation or when I'm teaching I often fall back on the same sample data sets over and over. At the same time, when I need something specific such as an ordered factor I have to search around to find one. I try to stick to the base datasets. I was thinking that it would be neat to have something (a package or a shiny app or a combination) that would let you search for a specific class of data structure (data frame, matrix, ts, dist, cube etc (there are a lot)) an also for specific variable types for those types that support multiple types. Maybe also experimental versus observational? https://vincentarelbundock.github.io/Rdatasets/datasets.html has a list of the data sets, but the purpose of that archive is more to put them all into csv format in a consistent manner.
An added bonus would be to be able to make the api generic enough to search other packages but my initial goal would be the ones in datasets.