sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
373 stars 150 forks source link

ValueError: invalid literal for int() with base 10: when reading a dataframe with a SPA character #472

Closed zandbak closed 2 years ago

zandbak commented 2 years ago

Describe the bug the SAS session crashes on dataframe2sasdata when reading a SPA character.

(3) Spark Jobs Using SAS Config named: default SAS Connection established. Subprocess id is 15174

+----------+ |testcolumn| +----------+ | –| +----------+

No SAS process attached. SAS process has terminated unexpectedly. ValueError: invalid literal for int() with base 10: 'No SAS process attached. SAS process has terminated unexpectedly.'

ValueError Traceback (most recent call last)

in 300 301 if __name__ == "__main__": # pragma: no cover --> 302 main() in main() 294 df = spark_df.toPandas() 295 --> 296 sas.dataframe2sasdata( 297 df, 'test', 'work', encode_errors='replace') 298 /databricks/python/lib/python3.8/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep, colrep, datetimes, outfmts, labels, outdsopts, encode_errors, char_lengths, **kwargs) 1533 datetimes, outfmts, labels, outdsopts, encode_errors, char_lengths, **kwargs) 1534 if rc is None: -> 1535 if self.exist(table, libref): 1536 dsopts = {} 1537 if outencoding: /databricks/python/lib/python3.8/site-packages/saspy/sasbase.py in exist(self, table, libref) 913 :rtype: bool 914 """ --> 915 return self._io.exist(table, libref) 916 917 def sasets(self) -> 'SASets': /databricks/python/lib/python3.8/site-packages/saspy/sasioiom.py in exist(self, table, libref) 1156 l2 = ll['LOG'].rpartition("TABLE_EXISTS= ") 1157 l2 = l2[2].partition("\n") -> 1158 exists = int(l2[0]) 1159 1160 return bool(exists) ValueError: invalid literal for int() with base 10: 'No SAS process attached. SAS process has terminated unexpectedly.' **To Reproduce** spark_df = spark.createDataFrame([ Row(testcolumn='\x96') ]) spark_df.show() df = spark_df.toPandas() sas.dataframe2sasdata( df, 'test', 'work', encode_errors='replace') **Expected behavior** a) That is works, as designed. b) if an error is given, it would be nice if the errormessage could be a little more insightful to what the actual problem could be. **Desktop (please complete the following information):** - OS: Azure Databricks - SAS Version: 9.4M6 **Additional context** The SAS server is configured with Latin1 encoding.
zandbak commented 2 years ago

N.B. it also fails with other special chars, like \x97, \x98.

tomweber-sas commented 2 years ago

Hey, sorry for not responding earlier, I was on vacation last week. Just taking a look at this now. Can you submit your SASsession object so I can see the details from it; like this:

>>> sas = saspy.SASsession()
SAS Connection established. Subprocess id is 2102

>>> sas
Access Method         = IOM
SAS Config name       = iomj
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_work1D460000380D_tom64-5/SAS_workA0C30000380D_tom64-5/
SAS Version           = 9.04.01M7D08052020
SASPy Version         = 4.3.1
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 14349

Also, can you run the following on the dataframe you're trying to use. I created a dataframe w/ this and it all works, but it uses the real utf-8 character, not a single byte invalid utf8 byte. I don't know what's in the dataframe you have since I'm creating mine differently.

>>> df
   0
0  Â
>>> df[0]
0    Â
Name: 0, dtype: object
>>> df[0][0]
'\x96'
>>> df[0][0].encode()
b'\xc2\x96'
>>> print(df[0][0])
Â
>>>

Here's the rest of what I tried for this. It works as expected, but I don't have enough info to see what's different for your case yet.


>>> sd = sas.df2sd(df)
>>> sd
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = Pandas

>>> sd.head()
   0
0  Â
>>> sd = sas.df2sd(df, results='text')
sd.>>> sd.head()
                                                           The SAS System                      Monday, July 18, 2022 02:16:00 PM   1

                                                             Obs    0

                                                              1     Â

>>>

Given the traceback, it seems as if the SAS session terminated in the middle of the df2sd() method. I've seen IOM do that when it get's invalid data (I think it's trying to transcode and if that fails it just dies). That's my best guess at the moment, but I need more info to go on. My guess it that in your dataframe you don't have a valid utf-8 char, rather only a single byte which isn't valid utf-8. So, need to see what you really have in there.

Thanks, Tom

zandbak commented 2 years ago

Hi Tom,

Thank you for your reply. Hope you had a good holiday. This is the output of the sassession object:

_Access Method = IOM SAS Config name = default SAS Config file = /databricks/python/lib/python3.8/site-packages/saspy/sascfg.py WORK Path = /sas/work/SAS_workA41E01EE01C8_sasgridcomputeo2.infra.local/SAS_workEDD701EE01C8_sasgridcomputeo2.infra.local/ SAS Version = 9.04.01M6P11072018 SASPy Version = 4.3.1 Teach me SAS = False Batch = True Results = Pandas SAS Session Encoding = latin1 Python Encoding value = latin1 SAS process Pid value = 32375240

To make it a bit easier to reproduce, I swapped the spark code with pandas code:

import pandas as pd

    lst = ['\x96']

    df = pd.DataFrame(lst)   
    print(df)
    print(df[0][0].encode())
    print(df[0][0])

    sas.dataframe2sasdata(df, 'test', 'work', encode_errors='replace')

This is the output: 0 0 – b'\xc2\x96' – No SAS process attached. SAS process has terminated unexpectedly.

We have (temporally) solved it by using a regex expression to remove all control characters, but it would of course be nice to not have to run this additional step.

replaced_df = (reduce(
        lambda df, col_name: df.withColumn(
            col_name, regexp_replace(col(col_name), '\p{C}', '')),
        string_cols,
        spark_df
    ))
tomweber-sas commented 2 years ago

Hey, thank you, I had a great vacation!

Thanks for the info above. I see now that you're running SAS in Latin1, which is why this is failing for you but I wasn't seeing it in my test case (I was running in utf-8). I've been able to reproduce this (same as you're seeing), and it is a case where the IOM java client code is terminating due to a transcode error down in the IOM client code; thus the 'SAS process terminated' failure. So that's what I was thinking before, though that's not a complete answer.

FWIW, x96 (\U0096) isn't a character in the Latin1 codepage. That and the others you mentioned aren't valid code points either.

I've been debugging this, in both the Python code and in the java code, as it's kind a matter of the SAS IOM layer getting a transcode failure that ends up terminating the session. I'm trying to see if I can catch that and at least continue instead of failing like it does. Though, that still won't make invalid data get over to SAS since this failure is happening way below me in the IOM layer.

So, trying this with a valid latin1 char, say xA6 instead of an invalid char (x96), works as expected since there's no transcoding failure in the java transcode layer. I can't change the behavior of the layers in Java that are doing this.

So, I now have to ask about why you're trying to send invalid chars to SAS, or can you run SAS in an encoding that supports what you're trying to do? This works fine when the chars your sending exist in the codepage you're trying to write to. And, I'm not seeing a way for me to address it in my code, really. Other than handling the failure better, which, I've just pushed code to do! It's catching this failure and getting the error message from IOM and returning that back to Python. At least there's that.

So, can you pull main from github and at least see if you then are getting what I expect with whatever you're real case is? I assume trying to send this one byte over isn't the real case you're working with, but rather what you narrowed it down to for me :) The new code will produce the following, without terminating the SAS session, and keep going; though the failure caused the data set not to have the complete data in it (none in this case since only one chunk of data was sent), since and it failed transcoding.

>>> import pandas
>>> df = pandas.DataFrame(['\u00a6','\u00a6','\u00b6','\u00a6'])
>>> df
   0
0  ¦
1  ¦
2  ¶
3  ¦
>>> sd = sas.df2sd(df, results='text', encode_errors='replace')
>>> sd
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = text

>>> sd.head()

                                                           The SAS System                           15:27 Tuesday, July 19, 2022   1

                                                              Obs    0

                                                               1     ¦
                                                               2     ¦
                                                               3     ¶
                                                               4     ¦

>>> sd.obs()
4
>>>
>>> df = pandas.DataFrame(['\u00a6','\u00a6','\u0096','\u00a6'])
>>> df
   0
0  ¦
1  ¦
2  ?
3  ¦
>>> sd = sas.df2sd(df, results='text', encode_errors='replace')
Failure in the IOM client code, likely a transcoding error encountered. Data transfer stopped on or before row 4
Rendering the error from the Java layer:

We failed in Submit
Failed to transcode data from U_UTF8_CE to U_LATIN1_CE encoding because it contained characters which are not supported by your SAS session encoding.  Please review your encoding= and locale= SAS system options to ensure that they can accommodate the data 
that you want to process.  A portion of the source string, in hex representation is:
>>> sd
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = text

>>> sd.head()

>>> sd.obs()
0
>>>

pip uninstall -y saspy pip install git+https://git@github.com/sassoftware/saspy.git

Thanks, Tom

zandbak commented 2 years ago

Hi Tom,

Thank you for your quick reply. Great that you have been able to reproduce the error.

The code that we wrote is just an intermediary between an external datasource and our SAS environment. Unfortunately I can neither prevent these characters to be in the source-data nor can I change the encoding of the SAS environment.

Indeed it took a bit of time to narrow the problem down to this specific char ;-)

I presumed the _encodeerrors='replace' option would do what we prefer to do in this case; just replace the invalid chars in a string with nothing or a space. I will try to test the new version as soon as I am back from my holiday, but from what I understand from your message it will not be the solution for us, as understand it stops processing, instead of ignoring/replacing the specific char?

tomweber-sas commented 2 years ago

Yeah, I appreciate you narrowing down the test case, that really helped!

And, correct, this change won't make it work. Unfortunately Python doesn't have issue transcoding between utf-8 and latin-1 either direction, even for chars that don't exist in latin-1. It just blindly goes between \u0000-\u00FF <-> \xC200-\xC2FF. It doesn't appear to care that there are 2 blocks of non-characters in Latin-1 (x00-x1F and x7F-x9F). So, using encode_erors='replace' isn't replacing anything because it doesn't consider any of these an error.

Unfortunately, as it happens, the transcode code in the IOM Java Client code, does have issue w/ these 2 blocks of non-characters and throws an exception. The change I pushed catches this exception and returns the error and keeps saspy and SAS running, but I have no way to make that code 'replace' instead of throw an exception (that I know of). The 'right' way to support this is to have SAS run with an encoding that supports the data you're trying to store, but I understand if you can't make that change. The IOM error that's now being returned sort of states this: Failed to transcode data from U_UTF8_CE to U_LATIN1_CE encoding because it contained characters which are not supported by your SAS session encoding. Please review your encoding= and locale= SAS system options to ensure that they can accommodate the data that you want to process.

I will try to see if there's anything else I can do on my end. In the mean time, have a great holiday! Catch up when you get back.

Thanks, Tom

tomweber-sas commented 2 years ago

following up on this; I can't make IOM change it's behavior, but the fix to catch the Java exception and at least keep SAS and Java and Python going is in the current production release now. Running SAS in an encoding that supports the data you're trying to load is the only real solution for this, I'm afraid. So, if an admin can configure a session for you that uses an encoding that supports the data, that would be the best answer. Else, knowing you have this specific issue, you can clean the input data w/ your work around so that IOM doesn't fail with its transcoding error.

tomweber-sas commented 2 years ago

Given I can't change the java IOM client code, and you have a workaround in python to replace these bytes before trying to transfer them (but can't run SAS in a compatible encoding), I'm wondering if we can close this issue? I don't see a better answer for this situation. Thanks! Tom

zandbak commented 2 years ago

Hi Tom,

Thank you again for your response. This issue can be closed.

Kind regards, Rob

On Tue, Sep 20, 2022 at 5:09 PM Tom Weber @.***> wrote:

Given I can't change the java IOM client code, and you have a workaround in python to replace these bytes before trying to transfer them (but can't run SAS in a compatible encoding), I'm wondering if we can close this issue? I don't see a better answer for this situation. Thanks! Tom

— Reply to this email directly, view it on GitHub https://github.com/sassoftware/saspy/issues/472#issuecomment-1252502538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJKJ575CT3GG6FNTQ6AWF63V7HHTPANCNFSM53HMTTKQ . You are receiving this because you authored the thread.Message ID: @.***>

tomweber-sas commented 2 years ago

Thanks Rob. Let me know if there's anything else you need!

Thanks, Tom