Closed jk1420 closed 2 years ago
HI @jk1420, thanks for the report!
As it stands the SPSS variable name validation isn't correctly identifying non-latin alpha characters.
@hadley this appears to be because the regex is using perl = TRUE
. You had mentioned on the original PR (#660) that this is safer when it comes to unicode ranges - is there a specific issue with non-Perl compatible regexes you had in mind or are we OK to change this?
@gorcha what exactly is the definition of "alphabetical variable" that spss uses? [:alnum:]
does what I expect, which is to only allow ASCII characters, which I thought is what we were matching in spss.
(To fix this, I'd prefer to switch to using the appropriate \p{}
class with perl = TRUE
rather than changing the regexp engine.)
It's a little vague about the exact character classes supported:
Variable names can be up to 64 bytes long, and the first character must be a letter or one of the characters @, #, or $. Subsequent characters can be any combination of letters, numbers, nonpunctuation characters, and a period (.)
Note: Letters include any nonpunctuation characters used in writing ordinary words in the languages supported in the platform's character set.
I think just using the Letter and Number classes should be enough to capture everything needed (possibly allow marks as well?). There may be some characters that SPSS technically supports that are missed by this but I would rather be more restrictive and certain that the file can be read by SPSS than try and support every possible edge case.