tidyverse / haven

Read SPSS, Stata and SAS files from R
https://haven.tidyverse.org
Other
423 stars 115 forks source link

write_sav failed when colname is chinese #689

Closed jk1420 closed 2 years ago

jk1420 commented 2 years ago
info <- data.frame("流水号" = 1:2,  "性别"= factor(c('男','女')),  "年龄"= c(25,34)); info
#  流水号 性别 年龄
#1      1   男   25
#2      2   女   34

tmp <- tempfile(fileext=".sav")
write_sav(info, tmp)
#! Variables in `data` must have valid SPSS variable names.
#✖ Problems: `流水号`, `性别`, and `年龄`
#Run `]8;;rstudio:run:rlang::last_error()rlang::last_error()]8;;` to see where the error occurred.
gorcha commented 2 years ago

HI @jk1420, thanks for the report!

As it stands the SPSS variable name validation isn't correctly identifying non-latin alpha characters.

@hadley this appears to be because the regex is using perl = TRUE. You had mentioned on the original PR (#660) that this is safer when it comes to unicode ranges - is there a specific issue with non-Perl compatible regexes you had in mind or are we OK to change this?

hadley commented 2 years ago

@gorcha what exactly is the definition of "alphabetical variable" that spss uses? [:alnum:] does what I expect, which is to only allow ASCII characters, which I thought is what we were matching in spss.

hadley commented 2 years ago

(To fix this, I'd prefer to switch to using the appropriate \p{} class with perl = TRUE rather than changing the regexp engine.)

gorcha commented 2 years ago

It's a little vague about the exact character classes supported:

Variable names can be up to 64 bytes long, and the first character must be a letter or one of the characters @, #, or $. Subsequent characters can be any combination of letters, numbers, nonpunctuation characters, and a period (.)

Note: Letters include any nonpunctuation characters used in writing ordinary words in the languages supported in the platform's character set.

I think just using the Letter and Number classes should be enough to capture everything needed (possibly allow marks as well?). There may be some characters that SPSS technically supports that are missed by this but I would rather be more restrictive and certain that the file can be read by SPSS than try and support every possible edge case.