sasdata2parquet - Githubissues

The current approach to staging large tables (that do not fit into memory) from SAS into parquet files would be to paginate through them, i.e. using the sasdata2dataframe method. This is relatively inefficient as the overhead of opening the connection and potentially reapplying filters is considerable.

Proposed solution:

I have looked into the actual IO classes and modified the sasdata2dataframeDISK method from the SASsessionIOM class such that instead of reading the entire stream into a pandas dataframe, the stream is read in chunks into a pyarrow table and written out into a parquet file.

See a first working draft below:

import io
from saspy.sasioiom import SASsessionIOM,_read_sock,logger
import pyarrow as pa
import pyarrow.csv as pc
import pyarrow.parquet as pq

def sasdata2parquet(self: SASsessionIOM, 
                     parquet_file_path:str,
                     table: str, 
                     libref: str ='',
                     pa_schema:pa.schema = None,
                     dsopts: dict = None,
                     rowsep: str = '\n',#pyarrow needs \n #'\x01', 
                     colsep: str = '\x02',
                     rowrep: str = ' ',   
                       colrep: str = ' ', 
                     **kwargs
                     ) -> None:

   """
   This method exports the SAS Data Set to a Parquet file.
   table    - the name of the SAS Data Set you want to export to a Parquet file
   libref   - the libref for the SAS Data Set.
   parquet_file_path - path of the parquet file
   pa_schema - an optional pyarrow schema that overrides the default schema
   dsopts   - data set options for the input SAS Data Set
   rowsep   - the row seperator character to use; defaults to '\x01'
   colsep   - the column seperator character to use; defaults to '\x02'
   rowrep  - the char to convert to for any embedded rowsep chars, defaults to  ' '
   colrep  - the char to convert to for any embedded colsep chars, defaults to  ' '

   These two options are for advanced usage. They override how saspy imports data. For more info
   see https://sassoftware.github.io/saspy/advanced-topics.html#advanced-sd2df-and-df2sd-techniques

   my_fmts - bool: if True, overrides the formats saspy would use, using those on the data set or in dsopts=
   """

   tmp = kwargs.pop('tempfile', None)
   tmp = kwargs.pop('tempkeep', None)

   dsopts = dsopts if dsopts is not None else {}

   logf     = b''
   lstf     = b''
   logn     = self._logcnt()
   logcodei = "%put E3969440A681A24088859985" + logn + ";"
   lstcodeo =   "E3969440A681A24088859985" + logn
   logcodeo = "\nE3969440A681A24088859985" + logn
   logcodeb = logcodeo.encode()

   if libref:
      tabname = libref+".'"+table.strip().replace("'", "''")+"'n "
   else:
      tabname = "'"+table.strip().replace("'", "''")+"'n "

   code  = "data work.sasdata2dataframe / view=work.sasdata2dataframe; set "+tabname+self._sb._dsopts(dsopts)+";run;\n"
   code += "data _null_; file LOG; d = open('work.sasdata2dataframe');\n"
   code += "length var $256;\n"
   code += "lrecl = attrn(d, 'LRECL'); nvars = attrn(d, 'NVARS');\n"
   code += "lr='LRECL='; vn='VARNUMS='; vl='VARLIST='; vt='VARTYPE=';\n"
   code += "put lr lrecl; put vn nvars; put vl;\n"
   code += "do i = 1 to nvars; var = compress(varname(d, i), '00'x); put var; end;\n"
   code += "put vt;\n"
   code += "do i = 1 to nvars; var = vartype(d, i); put var; end;\n"
   code += "run;"

   ll = self.submit(code, "text")

   try:
      l2 = ll['LOG'].rpartition("LRECL= ")
      l2 = l2[2].partition("\n")
      lrecl = int(l2[0])

      l2 = l2[2].partition("VARNUMS= ")
      l2 = l2[2].partition("\n")
      nvars = int(l2[0])

      l2 = l2[2].partition("\n")
      varlist = l2[2].split("\n", nvars)
      del varlist[nvars]

      dvarlist = list(varlist)
      for i in range(len(varlist)):
         varlist[i] = varlist[i].replace("'", "''")

      l2 = l2[2].partition("VARTYPE=")
      l2 = l2[2].partition("\n")
      vartype = l2[2].split("\n", nvars)
      del vartype[nvars]
   except Exception as e:
      logger.error("Invalid output produced durring sasdata2dataframe step. Step failed.\
      \nPrinting the error: {}\nPrinting the SASLOG as diagnostic\n{}".format(str(e), ll['LOG']))
      # return None

   topts = dict(dsopts)
   topts.pop('firstobs', None)
   topts.pop('obs', None)

   code  = "proc delete data=work.sasdata2dataframe(memtype=view);run;\n"
   code += "data work._n_u_l_l_;output;run;\n"
   code += "data _null_; set work._n_u_l_l_ "+tabname+self._sb._dsopts(topts)+";put 'FMT_CATS=';\n"

   for i in range(nvars):
      code += "_tom = vformatn('"+varlist[i]+"'n);put _tom;\n"
   code += "stop;\nrun;\nproc delete data=work._n_u_l_l_;run;"

   ll = self.submit(code, "text")

   try:
      l2 = ll['LOG'].rpartition("FMT_CATS=")
      l2 = l2[2].partition("\n")
      varcat = l2[2].split("\n", nvars)
      del varcat[nvars]
   except Exception as e:
      logger.error("Invalid output produced durring sasdata2dataframe step. Step failed.\
      \nPrinting the error: {}\nPrinting the SASLOG as diagnostic\n{}".format(str(e), ll['LOG']))
      # return None

   rdelim = "'"+'%02x' % ord(rowsep.encode(self.sascfg.encoding))+"'x"
   cdelim = "'"+'%02x' % ord(colsep.encode(self.sascfg.encoding))+"'x "

   idx_col = kwargs.pop('index_col', False)
   eng     = kwargs.pop('engine',    'c')
   my_fmts = kwargs.pop('my_fmts',   False)
   k_dts   = kwargs.pop('dtype',     None)
   if k_dts is None and my_fmts:
      logger.warning("my_fmts option only valid when dtype= is specified. Ignoring and using necessary formatting for data transfer.")
      my_fmts = False

   code = "data _null_; set "+tabname+self._sb._dsopts(dsopts)+";\n"

   if not my_fmts:
      for i in range(nvars):
         if vartype[i] == 'N':
            code += "format '"+varlist[i]+"'n "
            if varcat[i] in self._sb.sas_date_fmts:
               code += 'E8601DA10.'
            else:
               if varcat[i] in self._sb.sas_time_fmts:
                  code += 'E8601TM15.6'
               else:
                  if varcat[i] in self._sb.sas_datetime_fmts:
                     code += 'E8601DT26.6'
                  else:
                     code += 'best32.'
            code += '; '
            if i % 10 == 9:
               code +='\n'

   lreclx = max(self.sascfg.lrecl, (lrecl + nvars + 1))

   miss  = {}
   code += "\nfile "+self._tomods1.decode()+" lrecl="+str(lreclx)+" dlm="+cdelim+" recfm=v termstr=NL encoding='utf-8';\n"
   for i in range(nvars):
      if vartype[i] != 'N':
         code += "'"+varlist[i]+"'n = translate('"
         code +=     varlist[i]+"'n, '{}'x, '{}'x); ".format(   \
                     '%02x%02x' %                               \
                     (ord(rowrep.encode(self.sascfg.encoding)), \
                        ord(colrep.encode(self.sascfg.encoding))),
                     '%02x%02x' %                               \
                     (ord(rowsep.encode(self.sascfg.encoding)), \
                        ord(colsep.encode(self.sascfg.encoding))))
         miss[dvarlist[i]] = ' '
      else:
         code += "if missing('"+varlist[i]+"'n) then '"+varlist[i]+"'n = .; "
         miss[dvarlist[i]] = '.'
      if i % 10 == 9:
         code +='\n'
   code += "\nput "
   for i in range(nvars):
      code += " '"+varlist[i]+"'n "
      if i % 10 == 9:
         code +='\n'
   code += rdelim+";\nrun;"

   if k_dts is None:
      dts = {}
      for i in range(nvars):
         if vartype[i] == 'N':
            if varcat[i] not in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
               dts[dvarlist[i]] = 'float'
            else:
               dts[dvarlist[i]] = 'str'
         else:
            dts[dvarlist[i]] = 'str'
   else:
      dts = k_dts

   quoting = kwargs.pop('quoting', 3)

   ll = self._asubmit(code, "text")
   self.stdin[0].send(b'\ntom says EOL='+logcodeb)
   #self.stdin[0].send(b'\n'+logcodei.encode()+b'\n'+b'tom says EOL='+logcodeb)
   # 
   # try:
   try:
      print("Getting stream")
      sockout = _read_sock(io=self, method='DISK', rsep=(colsep+rowsep+'\n').encode(), rowsep=rowsep.encode(),
                           lstcodeo=lstcodeo.encode(), logcodeb=logcodeb)

      if os.path.exists(parquet_file_path):
         os.remove(parquet_file_path)

      #### process stream ####
      parquet_writer = None
      parsed_chunk = None
      if os.path.exists(parquet_file_path):
         os.remove(parquet_file_path)
      parse_options = pc.ParseOptions(delimiter=colsep)
      read_options=pc.ReadOptions(column_names=dvarlist)
      if pa_schema:
         convert_options = pc.ConvertOptions(column_types=pa_schema)
      else:
         convert_options = pc.ConvertOptions(column_types=dts)

         while True:
            # pyarrow needs \n
            # parsed_chunk = sockout.read(4096*1000).replace(rowsep,'\n').encode()
            # instead set rowsep in the socket method
            parsed_chunk = sockout.read(4096*1000).encode()
            if parsed_chunk == b'':
               print("Done")
               break
            # print(parsed_chunk[:10])

            table = pc.read_csv(
               io.BytesIO(parsed_chunk),
               parse_options=parse_options,
               read_options=read_options,
               convert_options=convert_options)

            if not parquet_writer:
               parquet_writer = pq.ParquetWriter(parquet_file_path,schema=table.schema,use_deprecated_int96_timestamps=True,write_statistics = False)

            # Write the table chunk to the Parquet file
            parquet_writer.write_table(table)
   except:
      if os.name == 'nt':
         try:
            rc = self.pid.wait(0)
            self.pid = None
            self._sb.SASpid = None
            logger.fatal('\nSAS process has terminated unexpectedly. RC from wait was: '+str(rc))
            # return None
         except:
            pass
      else:
         rc = os.waitpid(self.pid, os.WNOHANG)
         if rc[1]:
               self.pid = None
               self._sb.SASpid = None
               logger.fatal('\nSAS process has terminated unexpectedly. RC from wait was: '+str(rc))
               # return None
      raise
   finally:
      if parquet_writer:
               parquet_writer.close()

import saspy
table="test"
libref="test"
parquet_file_path="test.parquet"
sas = saspy.SASsession()
sasdata2parquet(sas._io,table=table,libref=libref,parquet_file_path=parquet_file_path)

Results

With my small benchmark table <1GB, the speed compared to pagination is already >3x faster. I expect the difference to be even higher for very large tables that apply filters.

Conclusion

I think this would really be a worthwhile addition to the saspy project. Staging tables from SAS into data lakes is a common issue, and this would help reduce loading times for a lot of users.

I would like to know if there is any immediate feedback on this draft with points I missed. Please be critical.

If you think this would be a good addition, I could open a pull request with the initial draft method for the SASsessionIOM class. However, carefully integrating this into the other IOM classes and beyond is not in my current capacity. I would need to leave that to you.

@rainermensing Thanks for this POC! I will look into this and provide feedback. I can see the value of this for working with large data that you can't instantiate into a dataframe all at once. I will need to try things out and see what I find. It's the details of parsing the delimited data that is critical. I see there are some differences between the parsing options between pandas' and arrow's read_csv() methods, so I need to investigate those in more detail. I have a few things ahead of this at the moment, but I will dig in this week and see what I find, including how this would work in other access methods, should be very similar but I need to double check.

Thanks! Tom

HI @tomweber-sas , a small update from my side. Pyarrow does not support all the parameters of pandas read_csv like lineterminator,na_values or quoting. I have now reverted back to first using pandas to parse the chunk and then converting it into pyarrow for writing:

df = pd.read_csv(io.StringIO(chunk), index_col=idx_col, engine=eng, header=None, names=dvarlist,
      sep=colsep, lineterminator=rowsep, dtype=dts, na_values=miss,
      encoding='utf-8', quoting=quoting,dtype_backend='pyarrow', **kwargs)
df = df.astype(dts) # if a column is completely empty, it will be cast as null, so we need to set it (again) here
table = pa.Table.from_pandas(df)

Note the dtype_backend='pyarrow' for compatibility purposes.

Here is another update. I forgot to include the timestamp conversion that you do in the end. This now works properly. I also found that the dtype_backend parameter is not really necessary. I removed it since it might cause backward compatibility issues since it was added in a relatively recent version of pandas.

 df = pd.read_csv(io.StringIO(chunk), index_col=idx_col, engine=eng, header=None, names=dvarlist,
       sep=colsep, lineterminator=rowsep, dtype=dts, na_values=miss,# dtype_backend='pyarrow',
       encoding='utf-8', quoting=quoting, **kwargs) 
 df = df.astype(dts) # if a column is completely empty, it will be cast as null, so we need to set it (again) here
 if k_dts is None:  # don't override these if user provided their own dtypes
    for i in range(nvars):
       if vartype[i] == 'N':
          if varcat[i] in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
             df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]], errors='coerce',unit='ms')#pandas default ns unit is deprecated for parquet
 table = pa.Table.from_pandas(df,schema=pa_schema)

Thanks @rainermensing , I'm finishing up my other priorities and plan to spend time on this tomorrow! Having just glanced ant it all, this sounds promising. I was thinking of using pandas read_csv also, as I saw the other might not be able to handle the format being transferred. I had also saw that pandas has a to_parquet() method, which seemed like it might skip the steps of converting to arrow to then write out as parquet. But I haven't gotten further yet to look at that in any detail, or try this our at all - yet.

Tomorrow!

Tom

Hi Tom, I'm using pyarrow only because this enables me to write chunk by chunk into a single parquet file using the writer. With the pandas method you would have to write each chunk into separate partitions. This can be fine if you don't need a single file and are able to read them as a single file later but that might be an issue for some use cases.

Best Rainer

From: Tom Weber @.> Sent: Wednesday, May 8, 2024 10:03:57 PM To: sassoftware/saspy @.> Cc: Rainer Mensing @.>; Mention @.> Subject: Re: [sassoftware/saspy] sasdata2parquet (Issue #600)

Thanks @rainermensinghttps://github.com/rainermensing , I'm finishing up my other priorities and plan to spend time on this tomorrow! Having just glanced ant it all, this sounds promising. I was thinking of using pandas read_csv also, as I saw the other might not be able to handle the format being transferred. I had also saw that pandas has a to_parquet() method, which seemed like it might skip the steps of converting to arrow to then write out as parquet. But I haven't gotten further yet to look at that in any detail, or try this our at all - yet.

Tomorrow!

Tom

— Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/saspy/issues/600#issuecomment-2101334564, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHRLJFYCTICSP4RGEWSE6I3ZBKAK3AVCNFSM6AAAAABHFZLQ2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBRGMZTINJWGQ. You are receiving this because you were mentioned.Message ID: @.***>

Ok, I have the first working prototype for this. Not all the way there but it's working for the first time here. One thing is that it doesn't return a dataframe, so I don't think rolling it into the sd2df methods, as another method=, makes sense. It's really sd2pf() or something: sasdata2parquetfile(). Since it's just streaming the SAS data set to a parquet file via the same streaming as to_df.

I'm going to see about integrating it into the other access methods next, to be sure there's nothing out of the ordinary in the others, as well as add the base method to call (I'm directly calling the AM right now), then try to do some performance assessments. I'll post again when I have all of that going. Here's a bit from what I have with this first pass:

sas._io.sasdata2dataframePARQET('cars','sashelp',tempfile='parquet1')

>>> pa.parquet.read_table('parquet1')
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>>
>>> df = pd.read_parquet('parquet1')
>>> df
      Make                    Model   Type  Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0    Acura                      MDX    SUV    Asia        All  36945.0  33337.0         3.5        6.0       265.0      17.0         23.0  4451.0      106.0   189.0
1    Acura           RSX Type S 2dr  Sedan    Asia      Front  23820.0  21761.0         2.0        4.0       200.0      24.0         31.0  2778.0      101.0   172.0
2    Acura                  TSX 4dr  Sedan    Asia      Front  26990.0  24647.0         2.4        4.0       200.0      22.0         29.0  3230.0      105.0   183.0
3    Acura                   TL 4dr  Sedan    Asia      Front  33195.0  30299.0         3.2        6.0       270.0      20.0         28.0  3575.0      108.0   186.0
4    Acura               3.5 RL 4dr  Sedan    Asia      Front  43755.0  39014.0         3.5        6.0       225.0      18.0         24.0  3880.0      115.0   197.0
..     ...                      ...    ...     ...        ...      ...      ...         ...        ...         ...       ...          ...     ...        ...     ...
423  Volvo  C70 LPT convertible 2dr  Sedan  Europe      Front  40565.0  38203.0         2.4        5.0       197.0      21.0         28.0  3450.0      105.0   186.0
424  Volvo  C70 HPT convertible 2dr  Sedan  Europe      Front  42565.0  40083.0         2.3        5.0       242.0      20.0         26.0  3450.0      105.0   186.0
425  Volvo               S80 T6 4dr  Sedan  Europe      Front  45210.0  42573.0         2.9        6.0       268.0      19.0         26.0  3653.0      110.0   190.0
426  Volvo                      V40  Wagon  Europe      Front  26135.0  24641.0         1.9        4.0       170.0      22.0         29.0  2822.0      101.0   180.0
427  Volvo                     XC70  Wagon  Europe        All  35145.0  33112.0         2.5        5.0       208.0      20.0         27.0  3823.0      109.0   186.0

[428 rows x 15 columns]
>>>

@tomweber-sas thanks a lot for taking this further! Yes, I was just piggybacking on the 2dataframe method because this was the fastest way for me to get a POC without having to understand much of how the stream is constructed. I am sure you can work out a more integrated way. How are you parsing the stream now? Using pandas still?

I would be grateful if you could share your progress, Maybe you can open a feature branch? We want to integrate this approach into our project asap since it would mean a lot of performance gains. I think we could also give you valuable feedback by testing it in out pipeline.

ok, I have this in a new branch called parquet. You can take a look and start trying it out, especially to see what kind of performance you get. I have it in all of the access methods, though I know you're only using IOM, that's fine. I've added the following methods:

SASsession object: sd2pq() sasdata2parquet()

SASdata object: to_pq()

They all have the following signature:

    def sd2pq(self, table: str, libref: str ='', dsopts: dict = None,
                        parquetfile: str=None, pa_schema: 'pa.schema' = None,
                        rowsep: str = '\x01', colsep: str = '\x02',
                        rowrep: str = ' ',    colrep: str = ' ',
                        **kwargs) -> '<Pandas Data Frame object>':

So, you can user any of those to try it out. I haven't tried that pa.schema, and I'm thinking it's not really something to allow, but I'm not sure yet. Since I determine the data types of the dataframe based upon SAS metadata, and I allow overrides with the following extra parameters:

       These two options are for advanced usage. They override how saspy imports data. For more info
       see https://sassoftware.github.io/saspy/advanced-topics.html#advanced-sd2df-and-df2sd-techniques

       dtype   - this is the parameter to Pandas read_csv, overriding what saspy generates and uses
       my_fmts - bool: if True, overrides the formats saspy would use, using those on the data set or in dsopts=

I don't think that having another schema to use that may not match what I'm generating would be a good idea. But, I haven't looked into that yet.

Here's my test source that I was using to try this out. It's working for a first pass.

import pyarrow as pa
import saspy

sasi = saspy.SASsession(cfgname='iomj');   sasi
sass = saspy.SASsession(cfgname='sdssas'); sass
sash = saspy.SASsession(cfgname='dotav');  sash

sasi.sasdata2parquet('cars','sashelp', parquetfile='parqueti')
sass.sasdata2parquet('cars','sashelp', parquetfile='parquets')
sash.sasdata2parquet('cars','sashelp', parquetfile='parqueth')

sasi.sd2pq('cars','sashelp', parquetfile='parqueti')
sass.sd2pq('cars','sashelp', parquetfile='parquets')
sash.sd2pq('cars','sashelp', parquetfile='parqueth')

ti = pa.parquet.read_table('parqueti')
ts = pa.parquet.read_table('parquets')
th = pa.parquet.read_table('parqueth')

dfi = ti.to_pandas()
dfs = ts.to_pandas()
dfh = th.to_pandas()

ti
ts
th

dfi
dfs
dfh

and the output I get for that code is

tom64-7> python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import saspy
>>> sasi = saspy.SASsession(cfgname='iomj');   sasi
SAS Connection established. Subprocess id is 1494487

Access Method         = IOM
SAS Config name       = iomj
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_workBE29001627CA_tom64-7/SAS_work1083001627CA_tom64-7/
SAS Version           = 9.04.01M8P01182023
SASPy Version         = 5.12.0
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 1451978

>>> sass = saspy.SASsession(cfgname='sdssas'); sass
SAS Connection established. Subprocess id is 1494518

Access Method         = STDIO
SAS Config name       = sdssas
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_workB6E90016CE13_tom64-7/
SAS Version           = 9.04.01M8D01182023
SASPy Version         = 5.12.0
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = latin1
Python Encoding value = latin_1
SAS process Pid value = 1494547

>>> sash = saspy.SASsession(cfgname='dotav');  sash
SAS server started using Context Data Mining compute context with SESSION_ID=fee344ef-3290-442d-8184-6068bf141890-ses0000
Access Method         = HTTP
SAS Config name       = dotav
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /opt/sas/viya/config/var/tmp/compsrv/default/fee344ef-3290-442d-8184-6068bf141890/SAS_work2341000001E1_sas-compute-server-92eda991-4439-4729-a284-cf1ddbf52d6e-4/
SAS Version           = V.04.00M0P05012024
SASPy Version         = 5.12.0
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 481

>>> sasi.sasdata2parquet('cars','sashelp', parquetfile='parqueti')
'parqueti'
>>> sass.sasdata2parquet('cars','sashelp', parquetfile='parquets')
'parquets'
>>> sash.sasdata2parquet('cars','sashelp', parquetfile='parqueth')
'parqueth'
>>> sasi.sd2pq('cars','sashelp', parquetfile='parqueti')
'parqueti'
>>> sass.sd2pq('cars','sashelp', parquetfile='parquets')
'parquets'
>>> sash.sd2pq('cars','sashelp', parquetfile='parqueth')
'parqueth'
>>> ti = pa.parquet.read_table('parqueti')
>>> ts = pa.parquet.read_table('parquets')
>>> th = pa.parquet.read_table('parqueth')
>>> dfi = ti.to_pandas()
>>> dfs = ts.to_pandas()
>>> dfh = th.to_pandas()
>>> ti
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> ts
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> th
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> dfi
      Make                    Model   Type  Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0    Acura                      MDX    SUV    Asia        All  36945.0  33337.0         3.5        6.0       265.0      17.0         23.0  4451.0      106.0   189.0
1    Acura           RSX Type S 2dr  Sedan    Asia      Front  23820.0  21761.0         2.0        4.0       200.0      24.0         31.0  2778.0      101.0   172.0
2    Acura                  TSX 4dr  Sedan    Asia      Front  26990.0  24647.0         2.4        4.0       200.0      22.0         29.0  3230.0      105.0   183.0
3    Acura                   TL 4dr  Sedan    Asia      Front  33195.0  30299.0         3.2        6.0       270.0      20.0         28.0  3575.0      108.0   186.0
4    Acura               3.5 RL 4dr  Sedan    Asia      Front  43755.0  39014.0         3.5        6.0       225.0      18.0         24.0  3880.0      115.0   197.0
..     ...                      ...    ...     ...        ...      ...      ...         ...        ...         ...       ...          ...     ...        ...     ...
423  Volvo  C70 LPT convertible 2dr  Sedan  Europe      Front  40565.0  38203.0         2.4        5.0       197.0      21.0         28.0  3450.0      105.0   186.0
424  Volvo  C70 HPT convertible 2dr  Sedan  Europe      Front  42565.0  40083.0         2.3        5.0       242.0      20.0         26.0  3450.0      105.0   186.0
425  Volvo               S80 T6 4dr  Sedan  Europe      Front  45210.0  42573.0         2.9        6.0       268.0      19.0         26.0  3653.0      110.0   190.0
426  Volvo                      V40  Wagon  Europe      Front  26135.0  24641.0         1.9        4.0       170.0      22.0         29.0  2822.0      101.0   180.0
427  Volvo                     XC70  Wagon  Europe        All  35145.0  33112.0         2.5        5.0       208.0      20.0         27.0  3823.0      109.0   186.0

[428 rows x 15 columns]
>>> dfs
      Make                    Model   Type  Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0    Acura                      MDX    SUV    Asia        All  36945.0  33337.0         3.5        6.0       265.0      17.0         23.0  4451.0      106.0   189.0
1    Acura           RSX Type S 2dr  Sedan    Asia      Front  23820.0  21761.0         2.0        4.0       200.0      24.0         31.0  2778.0      101.0   172.0
2    Acura                  TSX 4dr  Sedan    Asia      Front  26990.0  24647.0         2.4        4.0       200.0      22.0         29.0  3230.0      105.0   183.0
3    Acura                   TL 4dr  Sedan    Asia      Front  33195.0  30299.0         3.2        6.0       270.0      20.0         28.0  3575.0      108.0   186.0
4    Acura               3.5 RL 4dr  Sedan    Asia      Front  43755.0  39014.0         3.5        6.0       225.0      18.0         24.0  3880.0      115.0   197.0
..     ...                      ...    ...     ...        ...      ...      ...         ...        ...         ...       ...          ...     ...        ...     ...
423  Volvo  C70 LPT convertible 2dr  Sedan  Europe      Front  40565.0  38203.0         2.4        5.0       197.0      21.0         28.0  3450.0      105.0   186.0
424  Volvo  C70 HPT convertible 2dr  Sedan  Europe      Front  42565.0  40083.0         2.3        5.0       242.0      20.0         26.0  3450.0      105.0   186.0
425  Volvo               S80 T6 4dr  Sedan  Europe      Front  45210.0  42573.0         2.9        6.0       268.0      19.0         26.0  3653.0      110.0   190.0
426  Volvo                      V40  Wagon  Europe      Front  26135.0  24641.0         1.9        4.0       170.0      22.0         29.0  2822.0      101.0   180.0
427  Volvo                     XC70  Wagon  Europe        All  35145.0  33112.0         2.5        5.0       208.0      20.0         27.0  3823.0      109.0   186.0

[428 rows x 15 columns]
>>> dfh
      Make                    Model   Type  Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0    Acura                      MDX    SUV    Asia        All  36945.0  33337.0         3.5        6.0       265.0      17.0         23.0  4451.0      106.0   189.0
1    Acura           RSX Type S 2dr  Sedan    Asia      Front  23820.0  21761.0         2.0        4.0       200.0      24.0         31.0  2778.0      101.0   172.0
2    Acura                  TSX 4dr  Sedan    Asia      Front  26990.0  24647.0         2.4        4.0       200.0      22.0         29.0  3230.0      105.0   183.0
3    Acura                   TL 4dr  Sedan    Asia      Front  33195.0  30299.0         3.2        6.0       270.0      20.0         28.0  3575.0      108.0   186.0
4    Acura               3.5 RL 4dr  Sedan    Asia      Front  43755.0  39014.0         3.5        6.0       225.0      18.0         24.0  3880.0      115.0   197.0
..     ...                      ...    ...     ...        ...      ...      ...         ...        ...         ...       ...          ...     ...        ...     ...
423  Volvo  C70 LPT convertible 2dr  Sedan  Europe      Front  40565.0  38203.0         2.4        5.0       197.0      21.0         28.0  3450.0      105.0   186.0
424  Volvo  C70 HPT convertible 2dr  Sedan  Europe      Front  42565.0  40083.0         2.3        5.0       242.0      20.0         26.0  3450.0      105.0   186.0
425  Volvo               S80 T6 4dr  Sedan  Europe      Front  45210.0  42573.0         2.9        6.0       268.0      19.0         26.0  3653.0      110.0   190.0
426  Volvo                      V40  Wagon  Europe      Front  26135.0  24641.0         1.9        4.0       170.0      22.0         29.0  2822.0      101.0   180.0
427  Volvo                     XC70  Wagon  Europe        All  35145.0  33112.0         2.5        5.0       208.0      20.0         27.0  3823.0      109.0   186.0

[428 rows x 15 columns]
>>>

@rainermensing have you had a chance to try this out? Is it behaving, performing any better?

Hi @tomweber-sas , yes performance has definitly improved significantly. We ran a test with all our staging job this weekend. There was just one new error that did not occur in sasdata2dataframe. It happened in sockout.read(chunk_size) method: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 98312: invalid continuation byte. I will investigate this more tomorrow.

I also openend a pull request with a couple features that are very central for our case that would certainly also benefit other users. Feel free to provide your feedback.

@tomweber-sas the encoding error was on me, we had an Exception handling for this that I missed. It's simply that some of our tables (which usually should) do not use utf-8.

@rainermensing , that's good to hear, thanks for investigating that!. I saw the PR, and should be able to check it out today.

@rainermensing I was able to integrate these changes, from the PR into the code base. I have it all in here, for all 3 access methods. I tweaked the signatures of the methods to have the parquet path first followed by the table/libref/dsopts and then all of the parquet options you've added. That way the first few are positional and you don't need to code the keywords (more how you had it). sas.sasdata2parquet('parquet_file1','cars','sashelp', partitioned=True) sas.sd2pq('parquet_file2','cars','sashelp') for instance.

Give it a spin and see that it's all working as you expect. I haven't been able to test it fully, but it's working for the use cases I had before. So, so far so good!

Tom

@tomweber-sas Thank you a lot! I hope I am not chasing you around too much haha. I just opened a new issue with a problem that we have had to build a workaround for in the past and that was not on my radar when I developed this method. Other than that I have also made a few other minor tweaks while testing 2parquet more this week. I would integrate this into my fork and make a pull request when I think I am ready some time this week, hopefully.

All good :) Yes I se the other issue. Will look when I get a chance. I'll check out the new tweaks when you get them ready.

ok @rainermensing, I've integrated your latest PR changes into the other access methods as well as into sasbase and sasdata as methods off of both SASsession and SASdata objects (including the aliases, sd2pq and to_pq). I had to change the doc strings to get my doc to build; the API doc is generated from those in the code. There are some other minor tweaks too, so if you will, take the code that's in this branch and then run your tests with it so that you're testing what I'm looking to merge in to main. Once you're satisfied with it, I can merge it in and create a new production version with it. I haven't tested out all of the various parameters that there are now, just the basics, so I'll need you to vet all of that.

Thanks! Tom

Hi @tomweber-sas, I just tested your branch and it seems to work flawlessly, thank you a lot! I guess this will be the last part I had to play in all thi. Any future bugs are yours to deal with I'm afraid. So, thank you again for your fantastic and swift support! All the best, Rainer

LOL! I haven't merged it into main yet. I can still @ sign you to help with bugs that are reported :) Seriously, I will spend more time validating this before merging in. But thank you for enhancing SASPy with this, I think it's a significant enhancement!

Well, today is my last day at the customer for which I developed this feature. After this, I won't have access to any SAS Server so I won't be able to test and hence fix anything. But feel free to tag me in any issues and I will try to help where I can!

Ok! I just published V5.15.0 with this method in it. It's here, on PyPI and will be on conda-forge when it's bot runs. Thanks again for contributing to make SASPy better! I enjoyed working on this with you. I'll close this, and I'll try to get back to the other issue with the out of bounds datatimes next!

Thanks again, Tom

sassoftware / saspy

sasdata2parquet #600

Proposed solution:

Results

Conclusion

607