ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

Open petebaker opened 8 years ago

petebaker commented 8 years ago

Subject: R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific studies, researchers use codebooks to document data formats, variable labels, factor labels, ranges for continuous variables, details of measuring instruments etc. Sometimes statisticians get a photocopied codebook or pdf but my preference (and that of aware researchers) is a spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and check for non-defined factor levels and identify out of range values for continuous variables. For data sets with hundreds of variables or when there are many data files with similar structure it is probably better to automate these procedures.

A package for extracting information from codebooks and using the meta data to assist labelling and data cleaning would prove useful to statisticians and scientists at the coal face of producing primary data.

ivanhanigan commented 8 years ago

Hi Peter, this is a direct hit on my main problem/use case for a new R tool! Do you think https://github.com/ropensci/auunconf/issues/11 and https://github.com/ropensci/auunconf/issues/18 also cover similar territory? I think a 'metadata' subgroup to the AU unconf would be wonderful. I am attending remotely from Canberra, and have a serious interest in spending the next couple days working on 'metadata-driven' tools to aid cleaning/checking/formatting/documenting our data.

On Wed, Apr 20, 2016 at 11:25 PM, Peter Baker notifications@github.com wrote:

Subject: R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific studies, researchers use codebooks to document data formats, variable labels, factor labels, ranges for continuous variables, details of measuring instruments etc. Sometimes statisticians get a photocopied codebook or pdf but my preference (and that of aware researchers) is a spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and check for non-defined factor levels and identify out of range values for continuous variables. For data sets with hundreds of variables or when there are many data files with similar structure it is probably better to automate these procedures.

A package for extracting information from codebooks and using the meta data to assist labelling and data cleaning would prove useful to statisticians and scientists at the coal face of producing primary data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/46

petebaker commented 8 years ago

Hi Ivan Sorry I didn't look at this until now I think we are looking at opposite ends of the same problem. As I see it, the generic core is how to keep and use codebook info (meta data) with a data frame or df_table or other object. This is great! My suggestion is at the end of reading in codebooks to obtain that info. Every project I work with (or at least say 90 in the last hundred - so I work with a lot of projects ;-) has a codebook and they are pretty similar in structure. So they have a certain, if somewhat variable, structure. For instance, just looked at missingDataGUI manual and a example data codebook is in rtf - much better if it was in .csv http://ftp.cdc.gov/pub/data/brfss/codebook_09.rtf at http://www.cdc.gov/brfss/data_documentation/index.htm A wholistic approach to metadata sounds great. Pity you are in Canberra but that's the beauty of web Suggestions welcome Cheers Peter

ivanhanigan commented 8 years ago

Yep. Did you see this http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html I think the example codebook is broken and I had to fix it. I can attach to a repo somehow if you want to explore? I still think / hope EML can do everything!

require(devtools)
source_gist(4497585)
## Sourcing https://gist.githubusercontent.com/jbryer/4497585/raw/ed0497d0283807b72aedbe3bcf752cd41dfd5ba0/parse.codebook.r

## SHA-1 hash of file is 39ed33b5c570e7c10e15e39842931e08ce11d89d
parse.codebook
## function(file, 
##                         var.names, 
##                         level.names,
##                         level.indent=c(' ','\t'),
##                         var.sep, 
##                         level.sep, 
##                         var.widths,
##                         level.widths,
##                         var.name = var.names[1],
##                         ...) {
##  stopifnot(var.name %in% var.names)
##  
##  stripWhite <- function(x) {
##      sub('([[:space:]]+$)', '', sub("(^[[:space:]]+)", '', x, perl=TRUE), perl=TRUE)
##  }
##  
##  codebook.raw <- readLines(file)
##  
##  #Remove blank lines
##  blanklines <- which(nchar(stripWhite(codebook.raw)) == 0)
##  linenums <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  linenums <- linenums[!linenums %in% blanklines]
##  linenums.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  linenums.levels <- linenums.levels[!linenums.levels %in% blanklines]
##  if(length(blanklines) > 0) {
##      codebook.raw <- codebook.raw[-blanklines]       
##  }
##  
##  rows <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  rows.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  rowmapping <- data.frame(pre=linenums, post=rows)
##  rowmapping.levels <- data.frame(pre=linenums.levels, post=rows.levels)
##  codebook <- codebook.raw[rows]
##  
##  if(!missing(var.sep)) { #Fields are delimited
##      split <- strsplit(codebook, var.sep, fixed=TRUE)
##      badrows <- codebook[sapply(split, length) != length(var.names)]
##      if(length(badrows) > 0) {
##          stop(paste("The following rows do not have ", length(var.names), 
##                     " values: ", paste(badrows, collapse=', '), sep=''))
##      }
##      codebook <- as.data.frame(matrix(stripWhite( 
##                      unlist(strsplit(codebook, var.sep, fixed=TRUE))),
##                      ncol=length(var.names), byrow=TRUE, 
##                      dimnames=list(1:length(rows), var.names)), 
##                      stringsAsFactors=FALSE)
##      codebook$linenum <- rows
##  } else if(!missing(var.widths)) { #Fields are fixed with
##      stopifnot(length(var.names) == length(var.widths))
##      left <- 1
##      cb <- data.frame(linenum=linenums)
##      for(i in seq_along(var.widths)) {
##          cb[,var.names[i]] <- sapply(
##              codebook, FUN=function(x) {
##                  stripWhite(substr(x, left, min(nchar(x), (left + var.widths[i]))))
##              })
##          left <- left + var.widths[i]
##      }
##      codebook <- cb
##      rm(cb)
##  } else {
##      stop("Must specify either var.sep or var.widths")
##  }
##  
##  varsWithFactors <- which(sapply(1:(length(rows)-1), 
##                      FUN=function(x) { rows[x] + 1 != rows[x+1] }))
##  varlevels <- list()
##  for(i in seq_along(rows[varsWithFactors])) {
##      start <- rows[varsWithFactors][i]
##      end <- rows[which(rows == start) + 1]
##      levels.raw <- codebook.raw[ (start + 1):(end - 1) ]
##      if(!missing(level.widths)) { #Fixed with levels
##          levels.raw <- lapply(levels.raw, FUN=function(x) {
##              left <- 1
##              lc <- character()
##              for(i in seq_along(level.widths)) {
##                  lc <- c(lc, 
##                          substr(x, left, min(nchar(x), (left + level.widths[[i]])))
##                  )
##                  left <- left + level.widths[i]
##              }
##              return(lc)
##          })
##      } else if(!missing(level.sep)) { #Delimited levels
##          levels.raw <- strsplit(sub('(^[[:space:]]+)', '', levels.raw, perl=TRUE), 
##                                 level.sep, fixed=TRUE)
##      } else {
##          stop('Must specify either level.sep or level.widths')
##      }
##      levels.df <- data.frame(linenum=rowmapping.levels[rowmapping.levels$post > start &
##                                             rowmapping.levels$post < end, 'pre'])
##      for(i in seq_along(level.names)) {
##          levels.df[,level.names[i]] <- sapply(levels.raw, FUN=function(x) { stripWhite(x[i]) })
##      }
##      var <- codebook[codebook$linenum == rowmapping[start == rowmapping$post,'pre'], var.name]
##      varlevels[[var]] <- levels.df
##  }
##  
##  codebook$isfactor <- codebook$var %in% names(varlevels)
##  
##  attr(codebook, 'levels') <- varlevels
##  class(codebook) <- c('codebook', 'data.frame')
##  return(codebook)
## }
"
http://jason.bryer.org/codebooks/ccdCodebook.txt
The codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.
"
## [1] "\nhttp://jason.bryer.org/codebooks/ccdCodebook.txt\nThe codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.\n"
setwd("~/tools/disentangle/codebooks_jbryer")
ccd.codebook <- parse.codebook(
  file = 'ccdCodebook.txt'
  , 
                                var.names=c('variable','order','type','description')
  ,
                                level.names=c('level','label')
  ,
                                level.sep='='
  , 
                                var.widths=c(13, 7, 7, Inf)
  )

head(ccd.codebook)
##   linenum variable order type
## 1       1 SURVYEAR     1   AN
## 2       3  NCESSCH     2   AN
## 3       5    FIPST     3   AN
## 4      17     IVAN     4   AN
##                                                                                      description
## 1                                                           Year corresponding to survey record.
## 2 Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO).
## 3                                      American National Standards Institute (ANSI) state code..
## 4                                                                                           Name
##   isfactor
## 1    FALSE
## 2    FALSE
## 3     TRUE
## 4    FALSE
str(ccd.codebook)
## Classes 'codebook' and 'data.frame': 4 obs. of  6 variables:
##  $ linenum    : int  1 3 5 17
##  $ variable   : chr  "SURVYEAR" "NCESSCH" "FIPST" "IVAN"
##  $ order      : chr  "1" "2" "3" "4"
##  $ type       : chr  "AN" "AN" "AN" "AN"
##  $ description: chr  "Year corresponding to survey record." "Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO)." "American National Standards Institute (ANSI) state code.." "Name"
##  $ isfactor   : logi  FALSE FALSE TRUE FALSE
##  - attr(*, "levels")=List of 1
##   ..$ FIPST:'data.frame':    9 obs. of  3 variables:
##   .. ..$ linenum: int  7 8 9 10 11 12 13 14 15
##   .. ..$ level  : chr  "01" "02" "04" "05" ...
##   .. ..$ label  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
ccd.var.levels <- attr(ccd.codebook, 'levels')
names(ccd.var.levels)
## [1] "FIPST"
ccd.var.levels[['FIPST']]
##   linenum level                label
## 1       7    01              Alabama
## 2       8    02               Alaska
## 3       9    04              Arizona
## 4      10    05             Arkansas
## 5      11    06           California
## 6      12    08             Colorado
## 7      13    09          Connecticut
## 8      14    10             Delaware
## 9      15    11 District of Columbia
petebaker commented 8 years ago

Will explore after lunch - been exploring my codebook dryworkflow stuff on github which I haven't looked at for a year - too much on unfortunately but this is perfect excuse. Cheers

TanyaMurphy commented 7 years ago

Hi, Very excited to find this thread. I had thrown together a note for myself a few days ago, which I included below. I will look at your posts more closely in a couple of weeks and try to make connections. In the meantime, you may find these resources interesting (if you haven't already seen them).

Regarding my Excel reference: Not a fan of it for my solo work, but it's popular with research coordinators and serves as a common tool for many research teams.


Could there be convergence between the data-cleaning systems of

In particular:

Rules-based programming - SE

https://github.com/tidyverse/tidyr/issues/183

https://github.com/hadley/dplyr/issues/631 (incidentally, Edwin de Jonge started this thread, but did not comment again - are they taking a different path with their deducorrect and Data cleaning packages?)

Is this relevant?
ropenscilabs/assertr: Assertive programming for R analysis pipelines


I look forward to following the developments on this topic. Cheers!

ivanhanigan commented 7 years ago

Hi Tanya, good to hear from you. Also check out the EML package http://ropensci.github.io/EML/ (on CRAN) which allows define the attributes of data objects such as geographic coverage, temporal coverage, type (eg. data.frame, image, spatialPolygon), missing values, enumerated value lists and numeric ranges, along with copious amounts of relevant metadata such as context of who, why, IP, licencing etc etc.

On Fri, Jan 20, 2017 at 2:56 AM, Tanya Murphy notifications@github.com wrote:

Hi, Very excited to find this thread. I had thrown together a note for myself a few days ago, which I included below. I will look at your posts more closely in a couple of weeks and try to make connections. In the meantime, you may find these resources interesting (if you haven't already seen them).

Regarding my Excel reference: Not a fan of it for my solo work, but it's popular with research coordinators and serves as a common tool for many research teams.

Could there be convergence between the data-cleaning systems of

-

Edwin de Jonge and Mark van der Loo https://github.com/data-cleaning

Win-Vector "parametrized" data cleaning e.g.

  Using replyr::let to Parameterize dplyr Expressions – Win-Vector
  Blog
  <http://www.win-vector.com/blog/2016/12/using-replyrlet-to-parameterize-dplyr-expressions/>
  #rlang #data-prep
  -

  variable treatment – Win-Vector Blog
  <http://www.win-vector.com/blog/tag/variable-treatment/> #rlang
  #data-prep
  -

  Parametric variable names and dplyr – Win-Vector Blog
  <http://www.win-vector.com/blog/2016/12/parametric-variable-names-and-dplyr/>
  -

tidyverse

In particular:

  • facilitate using and enhancing Excel-based data dictionary/codebook that contains (obviously) variable names, labels, description, units (if applicable), flagging/cleaning rules, factoring/categorization rules for individual variables. Flagging/cleaning and categorization rules for joint distributions -- ideally be used as inputs to emerging functions that replace cumbersome nested ifelse staements.

Rules-based programming - SE http://softwareengineering.stackexchange.com/questions/103659/how-can-one-manage-thousands-of-if-then-else-rules

tidyverse/tidyr#183 https://github.com/tidyverse/tidyr/issues/183

hadley/dplyr#631 https://github.com/hadley/dplyr/issues/631 (incidentally, Edwin de Jonge started this thread, but did not comment again - are they taking a different path with their deducorrect and Data cleaning packages https://github.com/data-cleaning?)

Is this relevant? ropenscilabs/assertr: Assertive programming for R analysis pipelines https://github.com/ropenscilabs/assertr

I look forward to following the developments on this topic. Cheers!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/auunconf/issues/46#issuecomment-273814496, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPsp2dHe2Zo31eoqW-0ZKhrTc34VF4Yks5rT4ergaJpZM4ILsFR .