read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)"

pdbailey0 commented 3 years ago

In haven 2.4.0, 2.4.1 (And 2.4.1.9000) I get an error when reading in the International Association for the Evaluation of Educational Achievement's ePIRLS data available here.

After unzipping that to ~/ePIRLS/2016/ I do

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav")
# Error: Failed to parse [snip]/ePIRLS/2016/asausae1.sav: Unable to convert string to the requested encoding (invalid byte sequence).

however, with haven 2.3.1 it reads in without errors. If I write it out with the haven 2.3.1 write_sav, then haven 2.4.1 can read in that file cleanly.

pdbailey0 commented 3 years ago

Maybe this is addressed in the read_dta/read_stata documentation which reads,

If you encounter an error such as "Unable to convert string to the requested encoding", try encoding = "latin1"

and this runs cleanly (no error on exit)

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav", encoding="latin1")

pdbailey0 commented 3 years ago

Nevertheless, it's odd that there is nothing in the NEWS since 2.1.0 about read_sav but changed behavior. Is there a way to see what encoding is being used?

hadley commented 3 years ago

Unfortunately there's no easy way to track down what went wrong here 😞

Deleetdk commented 3 years ago

The file here https://worldsofjournalism.org/data-d79/data-and-key-tables-2012-2016/ produces the same error, and it solved by @pdbailey0's suggestion above.

> woj2 = read_sav("data/WJS2 open V4-02 030517.sav")
Error: Failed to parse /science/projects/world of journalism/data/WJS2 open V4-02 030517.sav: Unable to convert string to the requested encoding (invalid byte sequence).
> woj2 = read_sav("data/WJS2 open V4-02 030517.sav", encoding = "latin1")

pdbailey0 commented 3 years ago

It would be really great to be able to see what encoding is being used when one is not set.

sam-crawley commented 3 years ago

Hi,

The problem with the suggested workaround is that it can break encoding. e.g. loading Afrobarometer Wave 6 (available here: https://afrobarometer.org/data/merged-round-6-data-36-countries-2016), in haven 2.3.1:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "São Tomé and Príncipe"

But in haven 2.4.3:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav", encoding = "latin1")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "SÃ£o TomÃ© and PrÃncipe"

The file is UTF-8, and may have something broken about it, but loading it as latin1 may not be the solution?

hadley commented 3 years ago

@gorcha does this ring any bells with you? I don't think anything has changed in haven relating to this, so it might be a readstat bug?

pdbailey0 commented 3 years ago

haven only started documenting readStat version numbers in the NEWS in haven 2.4.0 (which uses readSav 1.1.5), but the 2.3.0 NEWS mentions sas "any" encoding, so that was probably readStat 1.1.2.

gorcha commented 3 years ago

Hey @hadley, definitely nothing haven related.

I've done a bit of digging (thanks @pdbailey0 for the version tip!) and it's because of this change in ReadStat 1.1.4 WizardMac/Readstat@a8b04663ad399159b8ac710ed629295a40290c65 - reverting this line to the old code loads this file successfully with the default encoding.

So it looks like something is falling over in iconv, but not sure what exactly. I'll have a poke around and see if I can find something definitive.

gorcha commented 3 years ago

This is pretty obscure, but the short version is SPSS is probably not our friend and doesn't encode UTF-8 properly.

The issue is that SPSS (or at least the version that produced the offending files) appears to store multi-byte unicode characters using the code point (a single byte) instead of code units (which can be 1 to 4 bytes).

The string that's causing the issue in the AfroBarometer file is "VOTAÇÃO", which shows up in row 39619. Having a look at the raw SPSS file, the hex representation is 56 4f 54 41 c7 c3 4f - the c7 and c3 represent the Ç and Ã characters respectively.

The problem is that, using Ã as an example, C3 is the "code point" representation, but the correct UTF-8 encoding is two bytes - C3 83 (see https://en.wikipedia.org/wiki/%C3%83).

So SPSS uses the correct "code point" representations of these two characters, but they're not the correct binary encoding for UTF-8. They should both be stored as multi-byte characters in a correct UTF-8 encoding.

I'm not deep enough in the ReadStat code to know why it was working fine, but it's failing now because it's being forced through iconv (for other very necessary reasons) under the totally fair but incorrect assumption that SPSS was encoding things in the way that it said it was.

You can get it down to the exact offending cell using:

read_sav("~/Downloads/merged_r6_data_2016_36countries2.sav", col_select = "Q29B", skip = 39618, n_max = 1)

@evanmiller can you please have a look?

gorcha commented 3 years ago

@pdbailey0, any idea what version of SPSS produced these files? It could be a version specific thing

evanmiller commented 3 years ago

Just FYI what you are calling the "code point" representation is actually Latin-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Is SPSS producing files containing both UTF-8 and Latin-1 data?

evanmiller commented 3 years ago

@pdbailey0 If you download the standalone readstat utility, it will report the file's self-reported encoding.

$ readstat binlfp2.sav
Format: SPSS binary file (SAV)
Columns: 8
Rows: 753
Table label: binlfp2
Format version: 2
Text encoding: UTF-8
Byte order: little-endian
Timestamp: 28 Oct 2015 14:34

pdbailey0 commented 3 years ago

@gorcha I didn't write it. You would have to ask Boston College. Maybe @sam-crawley knows what version wrote his?

sam-crawley commented 3 years ago

I was not involved in creating the Afrobarometer file either.

FWIW, I opened the file in SPSS 27. It happily opens the file, but shows broken/box characters for that one field.

I guess somehow a string with broken encoding got inserted in that field when the file was created (which SPSS perhaps should have detected and thrown an error about?)

However, since messy/broken data is a fact of life, perhaps the right approach is to warn about the problem, rather than to throw an error?

Thanks to everyone for their time on this issue so far.

gorcha commented 3 years ago

Oh my mistake, thanks @evanmiller!

I've had another look and it looks like in the Afrobarometer file there are just a handful of records with latin1 encoded characters. For e.g. the Ã shows up mostly as UTF-8 (C3 83) but a few times as latin1 (C3).

So I think you're right @sam-crawley, some funky characters have crept in at some point and SPSS doesn't properly enforce the encoding.

@evanmiller how would you feel about ReadStat copying over invalid bytes unedited rather than throwing an error? Obviously not ideal, but consistent with what SPSS does at least. I've hacked together something along those lines and it fixes this error, but I'm not sure what other nasty flow on effects there might be and how this would interact with other systems that ReadStat supports.

skalteis commented 3 years ago

Hi @gorcha, would you mind sharing your patch for ReadStat/haven to deal with this problem? I could not find anything in your forked ReadStat-repo about it and I have a broken SPSS-file here that I have to deal with. I'd very much appreciate it and would be very happy. :-)

Thank you & best regards, Simon

gorcha commented 3 years ago

Hey @skalteis,

Of course, always happy to help! :slightly_smiling_face: I've pushed the change to the invalid-bytes branch on my ReadStat fork if you want to have a look.

deschen1 commented 2 years ago

I have the same issue (same error message) with one of my data sets. I didn't understand all of what was said above, but wanted to check if there's a fix or workaround to prevent/solve this issue?

I'm currently using an older version of haven, but this does not work well with other packages (i.e. the labelled package).

gorcha commented 2 years ago

Hi @deschen1, a fix is in progress (requiring some changes in the underlying ReadStat library).

Unfortunately there's no simple workaround in the meantime, but hoping to get this fixed soon!

deschen1 commented 2 years ago

Thanks for the update nonetheless. And thanks for working on this bug/issue.

deschen1 commented 2 years ago

FWIW, I have opened a bug report in SPSS, just in case they might have been able to do sth. about the beahviour. Here's their response. Not sure if it helps to solve the issue, though. I highlighted in bold two potential helpful pieces.

The discussion here https://github.com/tidyverse/haven/issues/615#issuecomment-893584816 is about end users employing a third party product ReadStat to read an existing SPSS Statistics system file (*.sav) into their application (not SPSS Statistics). So immediately we have questions about: Did SPSS Statistics produce this system file? If it did, did is warn upon loading these data? Was SPSS Statistics correctly setup when it built this file? (we have many troubles with our customers randomly switching back and forth between Codepage mode and Unicode mode) See the continued discussion in your first link: https://github.com/tidyverse/haven/issues/615#issuecomment-893928250 The problem is a mix of UTF-8 and Latin1 (Codpage) characters in the example data file. SPSS Statistics will treat whatever you put into the system file as valid. In this case, the file was created with garbage text. I suspect, If it was created in SPSS Statistics, a warning was thrown when the original data was ingested prior to saving as '.sav'.

gorcha commented 2 years ago

Thanks @deschen1! Good to know, it confirms that SPSS doesn't enforce the specified character encoding.

aito123 commented 1 month ago

A solution that worked for me was to turn OFF the unicode inside SPSS. Open SPSS, create new syntax, run this code:

SET UNICODE OFF.

After that open the dataset and save it again. Credits to this post for this solution: https://stackoverflow.com/questions/3136293/read-spss-file-into-r

Hope there was a more automatic solution inside R...

tidyverse / haven

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615