tidyverse / haven

Read SPSS, Stata and SAS files from R
https://haven.tidyverse.org
Other
424 stars 117 forks source link

write_dta needs to check for valid Stata variable names #132

Closed wbuchanan closed 8 years ago

wbuchanan commented 8 years ago
d3urls <- list("Selections" = "https://github.com/mbostock/d3/wiki/Selections",
         "Transitions" = "https://github.com/mbostock/d3/wiki/Transitions",
         "Arrays" = "https://github.com/mbostock/d3/wiki/Arrays",
         "Requests" = "https://github.com/mbostock/d3/wiki/Requests",
         "Formatting" = "https://github.com/mbostock/d3/wiki/Formatting",
         "Localization" = "https://github.com/mbostock/d3/wiki/Localization",
         "Colors" = "https://github.com/mbostock/d3/wiki/Colors",
         "Namespaces" = "https://github.com/mbostock/d3/wiki/Namespaces",
         "Math" = "https://github.com/mbostock/d3/wiki/Math", 
         "Internals" = "https://github.com/mbostock/d3/wiki/Internals",
         "Behaviors - Drag" = "https://github.com/mbostock/d3/wiki/Drag-Behavior",
         "Behaviors - Zoom" = "https://github.com/mbostock/d3/wiki/Zoom-Behavior",
         "Geo - Paths" = "https://github.com/mbostock/d3/wiki/Geo-Paths", 
         "Geo - Projections" = "https://github.com/mbostock/d3/wiki/Geo-Projections", 
         "Geo - Streams" = "https://github.com/mbostock/d3/wiki/Geo-Streams",
         "Geom - Voronoi" = "https://github.com/mbostock/d3/wiki/Voronoi-Geom", 
         "Geom - Hull" = "https://github.com/mbostock/d3/wiki/Hull-Geom",
         "Geom - Polygon" = "https://github.com/mbostock/d3/wiki/Polygon-Geom", 
         "Geom - Quadtree" = "https://github.com/mbostock/d3/wiki/Quadtree-Geom", 
         "Layouts - Bundle" = "https://github.com/mbostock/d3/wiki/Bundle-Layout", 
         "Layouts - Chord" = "https://github.com/mbostock/d3/wiki/Chord-Layout", 
         "Layouts - Cluster" = "https://github.com/mbostock/d3/wiki/Cluster-Layout", 
         "Layouts - Force" = "https://github.com/mbostock/d3/wiki/Force-Layout", 
         "Layouts - Hierarchy" = "https://github.com/mbostock/d3/wiki/Hierarchy-Layout", 
         "Layouts - Histogram" = "https://github.com/mbostock/d3/wiki/Histogram-Layout", 
         "Layouts - Pack" = "https://github.com/mbostock/d3/wiki/Pack-Layout", 
         "Layouts - Partition" = "https://github.com/mbostock/d3/wiki/Partition-Layout", 
         "Layouts - Pie" = "https://github.com/mbostock/d3/wiki/Pie-Layout", 
         "Layouts - Stack" = "https://github.com/mbostock/d3/wiki/Stack-Layout", 
         "Layouts - Tree" = "https://github.com/mbostock/d3/wiki/Tree-Layout", 
         "Layouts - Treemap" = "https://github.com/mbostock/d3/wiki/Treemap-Layout", 
         "Scales - Quantitative" = "https://github.com/mbostock/d3/wiki/Quantitative-Scales",
         "Scales - Ordinal" = "https://github.com/mbostock/d3/wiki/Ordinal-Scales",
         "Scales - Timeseries" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "SVG - Shapes" = "https://github.com/mbostock/d3/wiki/SVG-Shapes",
         "SVG - Axes" = "https://github.com/mbostock/d3/wiki/SVG-Axes",
         "SVG - Controls" = "https://github.com/mbostock/d3/wiki/SVG-Controls",
         "Time - Formatting" = "https://github.com/mbostock/d3/wiki/Time-Formatting",
         "Time - Scales" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "Time - Intervals" = "https://github.com/mbostock/d3/wiki/Time-Intervals") 

library(magrittr)
colnm <- names(d3urls)
d3x <- xml2::read_html(d3urls[[1]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
d3x <- d3x[grepl("^#.*", d3x)] 
d3x <- gsub("# ", "", d3x) 
r <- c(1:length(d3x))
d3x <- as.data.frame(cbind(r, d3x), stringsAsFactors = FALSE)
names(d3x) <- c("id", colnm[1])

for (i in c(2:40)) {
    x <- xml2::read_html(d3urls[[i]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
    x <- x[grepl("^#.*", x)] 
    x <- gsub("# ", "", x) 
    r <- c(1:length(x))
    x <- as.data.frame(cbind(r, x), stringsAsFactors = FALSE)
    names(x) <- c("id", colnm[i])
    d3x <- dplyr::full_join(d3x, x, by = "id")  
}

rm(x, r)

haven::write_dta(d3x, "~/Desktop/d3Methods.dta")

Then I load the file in Stata 14.1MP8 using:

use ~/Desktop/d3Methods.dta, clear

The problem occurs when using the Stata command 'compress', which is used to optimize storage on disk of the file (e.g., downcasts types to the smallest type possible without loosing precision so things like 1.00000000000000000000000 would be cast as a 1-byte integer value rather than a float/double). In this case, I think there is a problem with the writing functions and how they insert binary zeros around the strings in the data frame (Stata uses binary zeros for padding a column so each record for a string column reserves the same number of bits for storage).

If I write the same data out to a csv:

write.csv(d3x, "~/Desktop/d3Methods.csv", row.names = FALSE)

Then load the same data in Stata:

. import delimited using ~/Desktop/d3Methods.csv, delim(",") varn(1) clear 
(41 vars, 102 obs)

. compress
  (0 bytes saved)

The issue goes away. I couldn't capture the other error since it crashed Stata each time. I can post the .dta files in version 13 and 14 if you'd like to compare it to the output from Haven.

wbuchanan commented 8 years ago

Figured out the issue here. It seems like nothing in the underlying C library is checking for valid names in Stata. So, the file is being written with variable (column) names like "Behavior - Drag" which is illegal in Stata. To be prototypical in the Stata world, any delimiters should be replaced be a single underscore and names converted to lowercase. It is fine to have "Behavior - Drag" for a variable label, but not for a variable name.

hadley commented 8 years ago

Could you please point me to the rules for determining valid stata variable names?

evanmiller commented 8 years ago

See also https://github.com/WizardMac/ReadStat/issues/46

hadley commented 8 years ago

I've been burnt too many times with R's helpful auto-renaming rules, so I've opted to be strict here and throw and error.