'utf-8' codec can't decode byte 0xc3 : saspy SPDE dataset - Githubissues

sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.

https://sassoftware.github.io/saspy

Other

373 stars 150 forks source link

'utf-8' codec can't decode byte 0xc3 : saspy SPDE dataset #456

Closed AnandReddy23 closed 2 years ago

AnandReddy23 commented 2 years ago

Hi Tom, I am working on saspy package and encountered below error. Could you help me with resolution please? My underlying sas data set is spds format.

Code : import saspy import pandas as pd from datetime import datetime sas = saspy.SASsession() sas.saslib('SS','SPDE','hdfs path which stores spds data sets','hdfshost=default') db=sas.sasdata('table','SS').to_df() Error : 'utf-8' codec can't decode byte 0xc3 in position 254179: invalid continuation byte sasdata2dataframe was interupted. Trying to return the saslog instead of a data frame.

sas Access Method = SSH SAS Config name = ssh SAS Config file = /python3.6/site-packages/saspy/sascfg.py WORK Path = <> SAS Version = 9.04.01M7P08052020 SASPy Version = 3.7.2 Teach me SAS = False Batch = False Results = Pandas SAS Session Encoding = utf-8 Python Encoding value = utf-8 SAS process Pid value = 1101

tomweber-sas commented 2 years ago

Great, thanks! So, the most likely cause of this is that the data in your spds data set doesn't contain valid utf-8 character(s). Do you know how the data set was created? Unfortunately SAS doesn't do anything to validate characters in it's own data sets. This is what happens when the data has to be transferred to another system (Python in this case, could be a database or other...), and the I/O access method (or other system) has to transcode the data. That's when things like this pop up. A very easy way for this (I just tracked down internally the other day) is using proc import to read a csv file and save it to a SAS data set (spds data set is the same). Proc import corrupted the data by defaulting to simply guess at the data types and lengths by reading the first handful of rows, but got the lengths wrong (too short) for some of the rows later in the file, and simply truncated them, chopping off parts of a multibyte character. This seems to be just like that case. 0xC3 is the first byte of a 2 byte sequence in utf8. My guess is that the data in your dataset isn't valid, probably truncated.

So, back to - do you know how that data set got created?

Tom

AnandReddy23 commented 2 years ago

Hi Tom , The initial data set is a AVRO format hive table which is then transformed to SAS dataset . And now this SAS data set is required in python environment. Also, the error log doesn't say which column it has caused the error. Adding below error from print(sas.lastlog()) ERROR: Broken pipe. FATAL: Unrecoverable I/O error detected in the execution of the DATA step program. Aborted during the EXECUTION phase.

tomweber-sas commented 2 years ago

Ok, do you know how the data was loaded into SAS? Again, if something like proc import, it can corrupt the data without even telling you. If it's an engine, that shouldn't corrupt the data as it should be transcoding as necessary, but it still depends. For the proc import case, you have to use GUESSINGROWS=MAX; for it to get the metadata corrrect, but then it takes a long time to run if the data's large. The other case is if I'm causing it. I see you have 3.7.2, which is about a year old. I've been looking back to see if I happened to have fixed anything like this since then. But, I think sd2df() (which you're getting the error on) is the current version, even back in that release. So I don't believe that's the case either. The error is a Python error trying to transcode the data being streamed over for the data frame. It's not in a format where it would know the row or column out of that stream, so it can't really provide that. As tedious as it is, I tracked down that same problem here by trying to use obs= and firstobs= to find the row with the corrupted data. It worked unless that row was included, so I just kept adjusting the rows till I tracked down the one w/ the truncated data. If we can see the code used to create the data set, that may help.

AnandReddy23 commented 2 years ago

Thanks Tom.

We are using SAS DI to connect to Hive schema and load the data to SAS table I think this error is because of a user free text column which is huge and can have any data (including special literals) . Could to help how to add column names list that should be part of my df from sas dataset in the below please? db=sas.sasdata('table','SS').to_df()

tomweber-sas commented 2 years ago

Can you show me the log from the DI step that did this? That would help. I'm not sure I understand what you are asking about a column list. What are you trying to do? Are you trying to drop that column you think has the problem, to see if that's it? Or are you just looking to see what the columns are of the data set? Or something else?

AnandReddy23 commented 2 years ago

Hi Tom - I don't have access to that code. I am sorry about that. Yes, I will drop just 1 that column which I think has problem and see if the rest of table gets loaded into df with out error.

Can you show me the log from the DI step that did this? That would help. I'm not sure I understand what you are asking about a column list. What are you trying to do? Are you trying to drop that column you think has the problem, to see if that's it? Or are you just looking to see what the columns are of the data set? Or something else?

tomweber-sas commented 2 years ago

Ok, I understand. FWIW, I believe the internal user I was helping the other day to find this same problem was using a UI (probably DI too), so I'm even more concerned that the dataset you're using has corrupt data in it. The log for this case here used proc import without setting the guessingrows=max - which did fix the problem since data was no longer being truncated arbitrarily.

To drop a column, you can simply use the dsopts dictionary: https://sassoftware.github.io/saspy/api.html#sas-data-object So, your code would just be something like this:

db=sas.sasdata('table','SS', dsopts={'drop' : 'bad_column'}).to_df()

tomweber-sas commented 2 years ago

You can also use obs and fitstobs in that dict to subset rows, to see if you can find the corrupt one(s). SASPy handles any valid characters, so a text field w/ special chars shouldn't be a problem, if they are characters in the encoding the data is supposed to be in. The case I suspect is that data is being truncated, and it happens to be truncating part of a multi-byte character, which makes it invalid; not a valid character anymore.

AnandReddy23 commented 2 years ago

sopts={'drop' : 'bad_column'}

Thanks Tom..yes, that column was the issue. :) Also from the link you shared, I added 'encoding' : 'latin9' , but still it returned the same error. Shouldn't encoding option resolve my issue?

tomweber-sas commented 2 years ago

Well, that's good that you know the column now. Changing that encoding value won't fix the data. What you showed says SAS is running in utf-8, so that's the encoding= value as that's what it means; what encoding is SAS running in, so I can transcode to Python's utf-8 correctly. I believe the data in that column (for at least one row, can't know how many) is truncated. Trying to interpret the bytes as a different encoding won't make it right, and if you did actually override the encoding value for saspy, then it would get transcode errors in other places or just have wrong data values for any characters that weren't 7bit ascii. I still believe the data in that column was truncated when it was created, leaving invalid bytes (parts of a multibyte character), causing this error when trying to process it outside of SAS (where it's just a fixed length byte array and not really processed as characters). Are you able to interact w/ however (whoever) this dataset was created?

AnandReddy23 commented 2 years ago

I tried other way . First I created a hive table using proc sql. It executed with out any error but the user free text broke in to new record and data was corrupted in the destination. Then I tried I below and the count was matching between source and destination. data hivesch.tbl(drop=user_text_2 dbcreate_table_opts='stored as avro'); set saslib.tbl; user_text_2=compress(user_text,,'kw'); user_text=user_text_2; run; Could we add the above logic while converting data from SAS to DF ?

tomweber-sas commented 2 years ago

I'm not sure I understand. You took the data set that is believe to have truncated data in it (corrupting some multibyte characters), used the compress function to remove some characters, as a means to try to get around this? If the process to create this dataset, from other data, causes truncation/corruption, you have to go back and fix that process so your output data is correct. You can't fix truncated data after the fact and have it be right, even if you clean up the partial characters, the data still isn't right. Maybe I'm missing what you're trying to do?

AnandReddy23 commented 2 years ago

Oh..I didn't realize kw truncates data instead of compressing and removing special characters. I will get length(user_text) for all rows between Hive and sas tables to see if any differences and change my code accordingly

tomweber-sas commented 2 years ago

I've never used that function., I had to look it up. It says: Returns a character string with specified characters removed from the original string. Again, you can't uncorrupt data after the fact, you have to fix the process so it doesn't get corrupted to begin with.

AnandReddy23 commented 2 years ago

I've never used that function., I had to look it up. It says: Returns a character string with specified characters removed from the original string. Again, you can't uncorrupt data after the fact, you have to fix the process so it doesn't get corrupted to begin with.

Thanks Tom. We have fixed the corrupt data and able to read the data now :)

tomweber-sas commented 2 years ago

Glad to hear that. I've been trying to find a function or macro in the SAS NLS stuff to actually validate the char data in a variable, but haven't found anything that does that. Did you fix it at the import time? Can you let me know what you found, even if you email me, instead of posting here, if that matters? I'm curious to know and it may help others if I understand what actually happened. Thanks! Tom

AnandReddy23 commented 2 years ago

Hi Tom, During ingestion to SAS from Hive data is cleansed and inserted and this has resolved the issues. I am sorry as I am not from the team and unable to share the code :(

tomweber-sas commented 2 years ago

I understand. Thanks for getting back to me :)