Closed rainermensing closed 6 days ago
@rainermensing Thanks for this POC! I will look into this and provide feedback. I can see the value of this for working with large data that you can't instantiate into a dataframe all at once. I will need to try things out and see what I find. It's the details of parsing the delimited data that is critical. I see there are some differences between the parsing options between pandas' and arrow's read_csv() methods, so I need to investigate those in more detail. I have a few things ahead of this at the moment, but I will dig in this week and see what I find, including how this would work in other access methods, should be very similar but I need to double check.
Thanks! Tom
HI @tomweber-sas , a small update from my side. Pyarrow does not support all the parameters of pandas read_csv like lineterminator,na_values or quoting. I have now reverted back to first using pandas to parse the chunk and then converting it into pyarrow for writing:
df = pd.read_csv(io.StringIO(chunk), index_col=idx_col, engine=eng, header=None, names=dvarlist,
sep=colsep, lineterminator=rowsep, dtype=dts, na_values=miss,
encoding='utf-8', quoting=quoting,dtype_backend='pyarrow', **kwargs)
df = df.astype(dts) # if a column is completely empty, it will be cast as null, so we need to set it (again) here
table = pa.Table.from_pandas(df)
Note the dtype_backend='pyarrow' for compatibility purposes.
Here is another update. I forgot to include the timestamp conversion that you do in the end. This now works properly. I also found that the dtype_backend parameter is not really necessary. I removed it since it might cause backward compatibility issues since it was added in a relatively recent version of pandas.
df = pd.read_csv(io.StringIO(chunk), index_col=idx_col, engine=eng, header=None, names=dvarlist,
sep=colsep, lineterminator=rowsep, dtype=dts, na_values=miss,# dtype_backend='pyarrow',
encoding='utf-8', quoting=quoting, **kwargs)
df = df.astype(dts) # if a column is completely empty, it will be cast as null, so we need to set it (again) here
if k_dts is None: # don't override these if user provided their own dtypes
for i in range(nvars):
if vartype[i] == 'N':
if varcat[i] in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]], errors='coerce',unit='ms')#pandas default ns unit is deprecated for parquet
table = pa.Table.from_pandas(df,schema=pa_schema)
Thanks @rainermensing , I'm finishing up my other priorities and plan to spend time on this tomorrow! Having just glanced ant it all, this sounds promising. I was thinking of using pandas read_csv also, as I saw the other might not be able to handle the format being transferred. I had also saw that pandas has a to_parquet() method, which seemed like it might skip the steps of converting to arrow to then write out as parquet. But I haven't gotten further yet to look at that in any detail, or try this our at all - yet.
Tomorrow!
Tom
Hi Tom, I'm using pyarrow only because this enables me to write chunk by chunk into a single parquet file using the writer. With the pandas method you would have to write each chunk into separate partitions. This can be fine if you don't need a single file and are able to read them as a single file later but that might be an issue for some use cases.
Best Rainer
From: Tom Weber @.> Sent: Wednesday, May 8, 2024 10:03:57 PM To: sassoftware/saspy @.> Cc: Rainer Mensing @.>; Mention @.> Subject: Re: [sassoftware/saspy] sasdata2parquet (Issue #600)
Thanks @rainermensinghttps://github.com/rainermensing , I'm finishing up my other priorities and plan to spend time on this tomorrow! Having just glanced ant it all, this sounds promising. I was thinking of using pandas read_csv also, as I saw the other might not be able to handle the format being transferred. I had also saw that pandas has a to_parquet() method, which seemed like it might skip the steps of converting to arrow to then write out as parquet. But I haven't gotten further yet to look at that in any detail, or try this our at all - yet.
Tomorrow!
Tom
— Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/saspy/issues/600#issuecomment-2101334564, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHRLJFYCTICSP4RGEWSE6I3ZBKAK3AVCNFSM6AAAAABHFZLQ2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBRGMZTINJWGQ. You are receiving this because you were mentioned.Message ID: @.***>
Ok, I have the first working prototype for this. Not all the way there but it's working for the first time here. One thing is that it doesn't return a dataframe, so I don't think rolling it into the sd2df methods, as another method=, makes sense. It's really sd2pf() or something: sasdata2parquetfile(). Since it's just streaming the SAS data set to a parquet file via the same streaming as to_df.
I'm going to see about integrating it into the other access methods next, to be sure there's nothing out of the ordinary in the others, as well as add the base method to call (I'm directly calling the AM right now), then try to do some performance assessments. I'll post again when I have all of that going. Here's a bit from what I have with this first pass:
sas._io.sasdata2dataframePARQET('cars','sashelp',tempfile='parquet1')
>>> pa.parquet.read_table('parquet1')
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>>
>>> df = pd.read_parquet('parquet1')
>>> df
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 Volvo C70 LPT convertible 2dr Sedan Europe Front 40565.0 38203.0 2.4 5.0 197.0 21.0 28.0 3450.0 105.0 186.0
424 Volvo C70 HPT convertible 2dr Sedan Europe Front 42565.0 40083.0 2.3 5.0 242.0 20.0 26.0 3450.0 105.0 186.0
425 Volvo S80 T6 4dr Sedan Europe Front 45210.0 42573.0 2.9 6.0 268.0 19.0 26.0 3653.0 110.0 190.0
426 Volvo V40 Wagon Europe Front 26135.0 24641.0 1.9 4.0 170.0 22.0 29.0 2822.0 101.0 180.0
427 Volvo XC70 Wagon Europe All 35145.0 33112.0 2.5 5.0 208.0 20.0 27.0 3823.0 109.0 186.0
[428 rows x 15 columns]
>>>
@tomweber-sas thanks a lot for taking this further! Yes, I was just piggybacking on the 2dataframe method because this was the fastest way for me to get a POC without having to understand much of how the stream is constructed. I am sure you can work out a more integrated way. How are you parsing the stream now? Using pandas still?
I would be grateful if you could share your progress, Maybe you can open a feature branch? We want to integrate this approach into our project asap since it would mean a lot of performance gains. I think we could also give you valuable feedback by testing it in out pipeline.
ok, I have this in a new branch called parquet. You can take a look and start trying it out, especially to see what kind of performance you get. I have it in all of the access methods, though I know you're only using IOM, that's fine. I've added the following methods:
SASsession object: sd2pq() sasdata2parquet()
SASdata object: to_pq()
They all have the following signature:
def sd2pq(self, table: str, libref: str ='', dsopts: dict = None,
parquetfile: str=None, pa_schema: 'pa.schema' = None,
rowsep: str = '\x01', colsep: str = '\x02',
rowrep: str = ' ', colrep: str = ' ',
**kwargs) -> '<Pandas Data Frame object>':
So, you can user any of those to try it out. I haven't tried that pa.schema, and I'm thinking it's not really something to allow, but I'm not sure yet. Since I determine the data types of the dataframe based upon SAS metadata, and I allow overrides with the following extra parameters:
These two options are for advanced usage. They override how saspy imports data. For more info
see https://sassoftware.github.io/saspy/advanced-topics.html#advanced-sd2df-and-df2sd-techniques
dtype - this is the parameter to Pandas read_csv, overriding what saspy generates and uses
my_fmts - bool: if True, overrides the formats saspy would use, using those on the data set or in dsopts=
I don't think that having another schema to use that may not match what I'm generating would be a good idea. But, I haven't looked into that yet.
Here's my test source that I was using to try this out. It's working for a first pass.
import pyarrow as pa
import saspy
sasi = saspy.SASsession(cfgname='iomj'); sasi
sass = saspy.SASsession(cfgname='sdssas'); sass
sash = saspy.SASsession(cfgname='dotav'); sash
sasi.sasdata2parquet('cars','sashelp', parquetfile='parqueti')
sass.sasdata2parquet('cars','sashelp', parquetfile='parquets')
sash.sasdata2parquet('cars','sashelp', parquetfile='parqueth')
sasi.sd2pq('cars','sashelp', parquetfile='parqueti')
sass.sd2pq('cars','sashelp', parquetfile='parquets')
sash.sd2pq('cars','sashelp', parquetfile='parqueth')
ti = pa.parquet.read_table('parqueti')
ts = pa.parquet.read_table('parquets')
th = pa.parquet.read_table('parqueth')
dfi = ti.to_pandas()
dfs = ts.to_pandas()
dfh = th.to_pandas()
ti
ts
th
dfi
dfs
dfh
and the output I get for that code is
tom64-7> python3
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import saspy
>>> sasi = saspy.SASsession(cfgname='iomj'); sasi
SAS Connection established. Subprocess id is 1494487
Access Method = IOM
SAS Config name = iomj
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_workBE29001627CA_tom64-7/SAS_work1083001627CA_tom64-7/
SAS Version = 9.04.01M8P01182023
SASPy Version = 5.12.0
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 1451978
>>> sass = saspy.SASsession(cfgname='sdssas'); sass
SAS Connection established. Subprocess id is 1494518
Access Method = STDIO
SAS Config name = sdssas
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_workB6E90016CE13_tom64-7/
SAS Version = 9.04.01M8D01182023
SASPy Version = 5.12.0
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = latin1
Python Encoding value = latin_1
SAS process Pid value = 1494547
>>> sash = saspy.SASsession(cfgname='dotav'); sash
SAS server started using Context Data Mining compute context with SESSION_ID=fee344ef-3290-442d-8184-6068bf141890-ses0000
Access Method = HTTP
SAS Config name = dotav
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /opt/sas/viya/config/var/tmp/compsrv/default/fee344ef-3290-442d-8184-6068bf141890/SAS_work2341000001E1_sas-compute-server-92eda991-4439-4729-a284-cf1ddbf52d6e-4/
SAS Version = V.04.00M0P05012024
SASPy Version = 5.12.0
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 481
>>> sasi.sasdata2parquet('cars','sashelp', parquetfile='parqueti')
'parqueti'
>>> sass.sasdata2parquet('cars','sashelp', parquetfile='parquets')
'parquets'
>>> sash.sasdata2parquet('cars','sashelp', parquetfile='parqueth')
'parqueth'
>>> sasi.sd2pq('cars','sashelp', parquetfile='parqueti')
'parqueti'
>>> sass.sd2pq('cars','sashelp', parquetfile='parquets')
'parquets'
>>> sash.sd2pq('cars','sashelp', parquetfile='parqueth')
'parqueth'
>>> ti = pa.parquet.read_table('parqueti')
>>> ts = pa.parquet.read_table('parquets')
>>> th = pa.parquet.read_table('parqueth')
>>> dfi = ti.to_pandas()
>>> dfs = ts.to_pandas()
>>> dfh = th.to_pandas()
>>> ti
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> ts
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> th
pyarrow.Table
Make: string
Model: string
Type: string
Origin: string
DriveTrain: string
MSRP: double
Invoice: double
EngineSize: double
Cylinders: double
Horsepower: double
MPG_City: double
MPG_Highway: double
Weight: double
Wheelbase: double
Length: double
----
Make: [["Acura","Acura","Acura","Acura","Acura",...,"Volvo","Volvo","Volvo","Volvo","Volvo"]]
Model: [["MDX","RSX Type S 2dr","TSX 4dr","TL 4dr","3.5 RL 4dr",...,"C70 LPT convertible 2dr","C70 HPT convertible 2dr","S80 T6 4dr","V40","XC70"]]
Type: [["SUV","Sedan","Sedan","Sedan","Sedan",...,"Sedan","Sedan","Sedan","Wagon","Wagon"]]
Origin: [["Asia","Asia","Asia","Asia","Asia",...,"Europe","Europe","Europe","Europe","Europe"]]
DriveTrain: [["All","Front","Front","Front","Front",...,"Front","Front","Front","Front","All"]]
MSRP: [[36945,23820,26990,33195,43755,...,40565,42565,45210,26135,35145]]
Invoice: [[33337,21761,24647,30299,39014,...,38203,40083,42573,24641,33112]]
EngineSize: [[3.5,2,2.4,3.2,3.5,...,2.4,2.3,2.9,1.9,2.5]]
Cylinders: [[6,4,4,6,6,...,5,5,6,4,5]]
Horsepower: [[265,200,200,270,225,...,197,242,268,170,208]]
...
>>> dfi
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 Volvo C70 LPT convertible 2dr Sedan Europe Front 40565.0 38203.0 2.4 5.0 197.0 21.0 28.0 3450.0 105.0 186.0
424 Volvo C70 HPT convertible 2dr Sedan Europe Front 42565.0 40083.0 2.3 5.0 242.0 20.0 26.0 3450.0 105.0 186.0
425 Volvo S80 T6 4dr Sedan Europe Front 45210.0 42573.0 2.9 6.0 268.0 19.0 26.0 3653.0 110.0 190.0
426 Volvo V40 Wagon Europe Front 26135.0 24641.0 1.9 4.0 170.0 22.0 29.0 2822.0 101.0 180.0
427 Volvo XC70 Wagon Europe All 35145.0 33112.0 2.5 5.0 208.0 20.0 27.0 3823.0 109.0 186.0
[428 rows x 15 columns]
>>> dfs
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 Volvo C70 LPT convertible 2dr Sedan Europe Front 40565.0 38203.0 2.4 5.0 197.0 21.0 28.0 3450.0 105.0 186.0
424 Volvo C70 HPT convertible 2dr Sedan Europe Front 42565.0 40083.0 2.3 5.0 242.0 20.0 26.0 3450.0 105.0 186.0
425 Volvo S80 T6 4dr Sedan Europe Front 45210.0 42573.0 2.9 6.0 268.0 19.0 26.0 3653.0 110.0 190.0
426 Volvo V40 Wagon Europe Front 26135.0 24641.0 1.9 4.0 170.0 22.0 29.0 2822.0 101.0 180.0
427 Volvo XC70 Wagon Europe All 35145.0 33112.0 2.5 5.0 208.0 20.0 27.0 3823.0 109.0 186.0
[428 rows x 15 columns]
>>> dfh
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 Volvo C70 LPT convertible 2dr Sedan Europe Front 40565.0 38203.0 2.4 5.0 197.0 21.0 28.0 3450.0 105.0 186.0
424 Volvo C70 HPT convertible 2dr Sedan Europe Front 42565.0 40083.0 2.3 5.0 242.0 20.0 26.0 3450.0 105.0 186.0
425 Volvo S80 T6 4dr Sedan Europe Front 45210.0 42573.0 2.9 6.0 268.0 19.0 26.0 3653.0 110.0 190.0
426 Volvo V40 Wagon Europe Front 26135.0 24641.0 1.9 4.0 170.0 22.0 29.0 2822.0 101.0 180.0
427 Volvo XC70 Wagon Europe All 35145.0 33112.0 2.5 5.0 208.0 20.0 27.0 3823.0 109.0 186.0
[428 rows x 15 columns]
>>>
@rainermensing have you had a chance to try this out? Is it behaving, performing any better?
Hi @tomweber-sas , yes performance has definitly improved significantly. We ran a test with all our staging job this weekend. There was just one new error that did not occur in sasdata2dataframe. It happened in sockout.read(chunk_size) method:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 98312: invalid continuation byte
. I will investigate this more tomorrow.
I also openend a pull request with a couple features that are very central for our case that would certainly also benefit other users. Feel free to provide your feedback.
@tomweber-sas the encoding error was on me, we had an Exception handling for this that I missed. It's simply that some of our tables (which usually should) do not use utf-8.
@rainermensing , that's good to hear, thanks for investigating that!. I saw the PR, and should be able to check it out today.
@rainermensing I was able to integrate these changes, from the PR into the code base. I have it all in here, for all 3 access methods. I tweaked the signatures of the methods to have the parquet path first followed by the table/libref/dsopts and then all of the parquet options you've added. That way the first few are positional and you don't need to code the keywords (more how you had it).
sas.sasdata2parquet('parquet_file1','cars','sashelp', partitioned=True)
sas.sd2pq('parquet_file2','cars','sashelp')
for instance.
Give it a spin and see that it's all working as you expect. I haven't been able to test it fully, but it's working for the use cases I had before. So, so far so good!
Tom
@tomweber-sas Thank you a lot! I hope I am not chasing you around too much haha. I just opened a new issue with a problem that we have had to build a workaround for in the past and that was not on my radar when I developed this method. Other than that I have also made a few other minor tweaks while testing 2parquet more this week. I would integrate this into my fork and make a pull request when I think I am ready some time this week, hopefully.
All good :) Yes I se the other issue. Will look when I get a chance. I'll check out the new tweaks when you get them ready.
ok @rainermensing, I've integrated your latest PR changes into the other access methods as well as into sasbase and sasdata as methods off of both SASsession and SASdata objects (including the aliases, sd2pq and to_pq). I had to change the doc strings to get my doc to build; the API doc is generated from those in the code. There are some other minor tweaks too, so if you will, take the code that's in this branch and then run your tests with it so that you're testing what I'm looking to merge in to main. Once you're satisfied with it, I can merge it in and create a new production version with it. I haven't tested out all of the various parameters that there are now, just the basics, so I'll need you to vet all of that.
Thanks! Tom
Hi @tomweber-sas, I just tested your branch and it seems to work flawlessly, thank you a lot! I guess this will be the last part I had to play in all thi. Any future bugs are yours to deal with I'm afraid. So, thank you again for your fantastic and swift support! All the best, Rainer
LOL! I haven't merged it into main yet. I can still @ sign you to help with bugs that are reported :) Seriously, I will spend more time validating this before merging in. But thank you for enhancing SASPy with this, I think it's a significant enhancement!
Well, today is my last day at the customer for which I developed this feature. After this, I won't have access to any SAS Server so I won't be able to test and hence fix anything. But feel free to tag me in any issues and I will try to help where I can!
Ok! I just published V5.15.0 with this method in it. It's here, on PyPI and will be on conda-forge when it's bot runs. Thanks again for contributing to make SASPy better! I enjoyed working on this with you. I'll close this, and I'll try to get back to the other issue with the out of bounds datatimes next!
Thanks again, Tom
The current approach to staging large tables (that do not fit into memory) from SAS into parquet files would be to paginate through them, i.e. using the sasdata2dataframe method. This is relatively inefficient as the overhead of opening the connection and potentially reapplying filters is considerable.
Proposed solution:
I have looked into the actual IO classes and modified the sasdata2dataframeDISK method from the SASsessionIOM class such that instead of reading the entire stream into a pandas dataframe, the stream is read in chunks into a pyarrow table and written out into a parquet file.
See a first working draft below:
Results
With my small benchmark table <1GB, the speed compared to pagination is already >3x faster. I expect the difference to be even higher for very large tables that apply filters.
Conclusion
I think this would really be a worthwhile addition to the saspy project. Staging tables from SAS into data lakes is a common issue, and this would help reduce loading times for a lot of users.
I would like to know if there is any immediate feedback on this draft with points I missed. Please be critical.
If you think this would be a good addition, I could open a pull request with the initial draft method for the SASsessionIOM class. However, carefully integrating this into the other IOM classes and beyond is not in my current capacity. I would need to leave that to you.