Closed AurelienThiery closed 1 year ago
hey @AurelienThiery, I won't be back in the office till Monday. But I'll take a look at this then! Thanks, Tom
Hey @AurelienThiery, I'm back and have been looking at the traceback and my code to see if I can account for any kind of memory leak, but I don't see anything. Since this is just filling up mem until it terminates the process, we can't see anything that might help. Have you tried anything diagnostic? Like using obs to only load half or a quarter of the table and then (assuming that worked), look at the log to see if there was anything out of the ordinary? And see how much memory Pandas uses for however much you loaded?
To explain the traceback a little, In my sd2df() once I've figured out all the metadata about the SAS dataset, and I start to stream the data, I call pandas read_csv (really, read delimited data with a ton of options you control) and pass it a filelike object (my socket reading object), and pandas just calls my read() (as if it were read() from a file), reading the stream and creating the dataframe from the data it's reading. I'm not seeing anything in my code that would be keeping any of the data around (helping fill memory while pandas is trying to build the df). So, I'm not sure why pandas is running out of memory. Though I have no idea how it uses mem while building the df, or how much mem or the objects take up for the df. I'll try to see if I can figure a way to debug the memory consumption. But, I think the next best thig is to get a better idea of if there's something going wrong for your case; like the other where there was an error in SAS but all the data transfer after filled the SASLOG and that caused the OOM.
So, can you try running this with less that the whole table so we can see the log after and see how much memory is really used for pandas to create the df? Work it from the other direction and see what we see. Start small and build up, seeing if the transient mem grows unusually or consistently. And see the SASLOG and see if the final mem for the df is less than the transient mem used while creating it.
Thanks, Tom
I was able to use tracemalloc to get a look at the memory consumption for the sd2df() call, for a test case I'm playing with. Pandas consumes 2 times the mem of the final dataframe while building it (I've seen it hit 3 times consumption too), while the memory accounted to saspy is less then 1M. This is with a sas dataset that's about 1G on disk (115 numeric columns and 1M rows). I'll see about running it on bigger data too, but here's the results from what I was doing. You can try the same on your test case (smaller data that will finish instead of abend the python process).
import tracemalloc
>>> sd = sas.sasdata('saspy1g','x')
>>> sd; sd.obs()
Libref = x
Table = saspy1g
Dsopts = {}
Results = Pandas
1000000
>>> tracemalloc.start()
>>> df = sd.to_df()
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy = 830705
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas = 1844859927
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 115 entries, xIn1 to yBinary
dtypes: float64(115)
memory usage: 877.4 MB
>>>
>>> tot_all = 0
>>> for stat in top_stats:
... #print(stat)
... tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all = 1873113836
>>>
Ok, so I'm learning about this tracemalloc module as I go. Clearly, it's not keeping a running total of memory consumption; it's each malloc (with no subtracting for free's). In this next test case I ran w/ a different 1G dataset that has a mix of char and num and datetimes. This made an enormous difference in the stats. but clearly the 'total' I'm getting from the trace records isn't a cumulative number, sine it shows 18,165,368,718,822 as sum of all the alloc's. Clearly, it was freeing most of this along the way too. I don't see that this module can to a running total, so that's still best observed at the OS layer looking at process statistics.
I did run the 4G version of my previous dataset (all doubles), and it took 9G (total allocs over the duration), though I observed over 2G for the process, just by watching top from another terminal for that python process.
None of this is definitive, but it is showing significant, and kinda strange, memory usage creating the dataframe. In this case below, with different data coming over, saspy had 1.3G of allocations over the duration, while pandas had 18,165G of allocations over the duration. So, there's no telling from this how much at any one time, but clearly, pandas is allocating and reallocating no end of objects and mem while creating the dataframe.
So, again, I think trying this on your data, with reduced number of observations; incrementing and looking at memory consumption, while trying increasingly larger number of observations (and look at the SAS log in case there's something going on there) is the course to try. How pandas uses memory seems to be all over the place from just these couple tries I ran with different kinds of data. The mixed data (not just all doubles) seemed to explode the amount of allocations.
tom64-7> python3
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> import tracemalloc
>>>
>>> sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')
SAS Connection established. Subprocess id is 113946
>>> sas
Access Method = IOM
SAS Config name = iomj
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_work1A5D0001BB01_tom64-7/SAS_workC33A0001BB01_tom64-7/
SAS Version = 9.04.01M8P10232022
SASPy Version = 4.3.5
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 113409
>>>
>>> sas.saslib('x', path='/opt/tom')
7 The SAS System Monday, October 24, 2022 12:47:00 PM
34
35 libname x '/opt/tom' ;
NOTE: Libref X was successfully assigned as follows:
Engine: V9
Physical Name: /opt/tom
36
37
38
8 The SAS System Monday, October 24, 2022 12:47:00 PM
39
>>> sd = sas.sasdata('chris','x')
>>> sd; sd.obs()
Libref = x
Table = chris
Dsopts = {}
Results = Pandas
50000
>>> sd.contents('text')
The SAS System Monday, October 24, 2022 12:19:00 PM 4
The CONTENTS Procedure
Data Set Name X.CHRIS Observations 50000
Member Type DATA Variables 139
Engine V9 Indexes 0
Created 10/04/2017 10:34:35 Observation Length 20368
Last Modified 10/04/2017 10:34:35 Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding latin1 Western (ISO)
Engine/Host Dependent Information
Data Set Page Size 245760
Number of Data Set Pages 4167
First Data Page 1
Max Obs per Page 12
Obs in First Data Page 10
Number of Data Set Repairs 0
Filename /opt/tom/chris.sas7bdat
Release Created 9.0401M4
Host Created X64_10PRO
Inode Number 412417536
Access Permission rwxr-xr-x
Owner Name sastpw
File Size 977MB
File Size (bytes) 1024327680
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
22 ACCRUED_MKT_ADJ_PCT Num 8 4. 4. ACCRUED_MKT_ADJ_PCT
23 ACCRUED_PERCENT Num 8 4. 4. ACCRUED_PERCENT
63 ADDRESS1 Char 100 $100. $100. ADDRESS1
64 ADDRESS2 Char 100 $100. $100. ADDRESS2
65 ADDRESS3 Char 100 $100. $100. ADDRESS3
66 ADDRESS4 Char 100 $100. $100. ADDRESS4
106 AIR Char 12 $12. $12. AIR
14 ALL_GRADES Char 1000 $1000. $1000. ALL_GRADES
99 ALL_INTERIOR_FINISHES Char 500 $500. $500. ALL_INTERIOR_FINISHES
18 ALL_LAND_DESCS Char 1000 $1000. $1000. ALL_LAND_DESCS
16 ALL_LAND_RATES Char 1000 $1000. $1000. ALL_LAND_RATES
101 ALL_MEZZANINE_FINISHES Char 500 $500. $500. ALL_MEZZANINE_FINISHES
108 ALL_ROOF_FLOOR_SYSTEM Char 500 $500. $500. ALL_ROOF_FLOOR_SYSTEM
127 ALL_ZONINGS Char 500 $500. $500. ALL_ZONINGS
95 ATTIC_FINISH Char 12 $12. $12. ATTIC_FINISH
96 ATTIC_FINISH_PERCENT Num 8 4. 4. ATTIC_FINISH_PERCENT
3 AUT_SNAPSHOT_DATE Num 8 DATETIME20. DATETIME20. AUT_SNAPSHOT_DATE
92 BASEMENT Char 12 $12. $12. BASEMENT
94 BASEMENT_FINISH Char 25 $25. $25. BASEMENT_FINISH
93 BASEMENT_PERCENT Num 8 4. 4. BASEMENT_PERCENT
128 BATHS Char 25 $25. $25. BATHS
68 BILLING_CLASS Num 8 BILLING_CLASS
69 BILLING_CLASS_DESC Char 20 $20. $20. BILLING_CLASS_DESC
52 BLDG_USE Char 8 $8. $8. BLDG_USE
53 BLDG_USE_DESC Char 20 $20. $20. BLDG_USE_DESC
37 BLDG_VALUE_ASSD Num 8 13. 13. BLDG_VALUE_ASSD
21 BMC_ADJUSTMENTS Char 1000 $1000. $1000. BMC_ADJUSTMENTS
126 CONDO_FACTORS Char 100 $100. $100. CONDO_FACTORS
41 CORP_LIMIT Char 8 $8. $8. CORP_LIMIT
42 CORP_LIMIT_DESC Char 50 $50. $50. CORP_LIMIT_DESC
39 DEEDED_ACREAGE Num 8 12.4 12.4 DEEDED_ACREAGE
57 DEED_BOOK Char 6 $6. $6. DEED_BOOK
59 DEED_DATE Num 8 DATETIME20. DATETIME20. DEED_DATE
58 DEED_PAGE Char 6 $6. $6. DEED_PAGE
27 DEPRECIATION_RATING Char 8 $8. $8. DEPRECIATION_RATING
54 DESIGN_STYLE Char 8 $8. $8. DESIGN_STYLE
55 DESIGN_STYLE_DESC Char 20 $20. $20. DESIGN_STYLE_DESC
71 DIRECTIONAL_PREFIX Char 2 $2. $2. DIRECTIONAL_PREFIX
74 DIRECTIONAL_SUFFIX Char 2 $2. $2. DIRECTIONAL_SUFFIX
8 EFFECTIVE_YEAR Num 8 5. 5. EFFECTIVE_YEAR
134 EXCLUSION_VALUE Num 8 13. 13. EXCLUSION_VALUE
48 EXEMPT_STATUS Char 8 $8. $8. EXEMPT_STATUS
49 EXEMPT_STATUS_DESC Char 20 $20. $20. EXEMPT_STATUS_DESC
135 EXEMPT_VALUE Num 8 13. 13. EXEMPT_VALUE
97 EXTERIOR_FINISH Char 12 $12. $12. EXTERIOR_FINISH
129 FIREPLACE_TYPE1 Num 8 4. 4. FIREPLACE_TYPE1
130 FIREPLACE_TYPE2 Num 8 4. 4. FIREPLACE_TYPE2
46 FIRE_DISTRICT Char 8 $8. $8. FIRE_DISTRICT
62 FULL_ADDRESS Char 1000 $1000. $1000. FULL_ADDRESS
13 GRADE Char 8 $8. $8. GRADE
105 HEAT Char 12 $12. $12. HEAT
137 HISTDEF_VALUE Num 8 13. 13. HISTDEF_VALUE
110 IGNORE_UNITS Char 3 $3. $3. IGNORE_UNITS
102 INCOME_INDICATED_VALUE Num 8 14.2 14.2 INCOME_INDICATED_VALUE
104 INCOME_PERS_PROPERTY_VALUE Num 8 13.2 13.2 INCOME_PERS_PROPERTY_VALUE
103 INCOME_RESIDUAL_LAND_VALUE Num 8 13.2 13.2 INCOME_RESIDUAL_LAND_VALUE
98 INTERIOR_FINISH Char 25 $25. $25. INTERIOR_FINISH
136 LANDDEF_VALUE Num 8 13. 13. LANDDEF_VALUE
50 LAND_CLASS Char 8 $8. $8. LAND_CLASS
51 LAND_CLASS_DESC Char 20 $20. $20. LAND_CLASS_DESC
19 LAND_CONDITIONS Char 1000 $1000. $1000. LAND_CONDITIONS
17 LAND_DESC Char 8 $8. $8. LAND_DESC
20 LAND_INFLUENCES Char 1000 $1000. $1000. LAND_INFLUENCES
15 LAND_RATE Num 8 16.2 16.2 LAND_RATE
11 LAND_SALE_DATE Num 8 DATETIME20. DATETIME20. LAND_SALE_DATE
9 LAND_SALE_PRICE Num 8 18.2 18.2 LAND_SALE_PRICE
83 LAND_SALE_PRICE_PER_ACRE Num 8 11. 11. LAND_SALE_PRICE_PER_ACRE
80 LAND_SALE_PRICE_PER_SQ_FT Num 8 12.2 12.2 LAND_SALE_PRICE_PER_SQ_FT
84 LAND_SALE_RATIO Num 8 12.2 12.2 LAND_SALE_RATIO
86 LAND_TO_BLDG_RATIO Num 8 12.2 12.2 LAND_TO_BLDG_RATIO
36 LAND_VALUE_ASSD Num 8 13. 13. LAND_VALUE_ASSD
82 LAND_VALUE_PER_ACRE Num 8 11. 11. LAND_VALUE_PER_ACRE
81 LAND_VALUE_PER_SQ_FT Num 8 12.2 12.2 LAND_VALUE_PER_SQ_FT
90 LATEST_LAND_SALE_DATE Num 8 DATETIME20. DATETIME20. LATEST_LAND_SALE_DATE
88 LATEST_LAND_SALE_PRICE Num 8 18.2 18.2 LATEST_LAND_SALE_PRICE
91 LATEST_PACKAGE_SALE_DATE Num 8 DATETIME20. DATETIME20. LATEST_PACKAGE_SALE_DATE
89 LATEST_PACKAGE_SALE_PRICE Num 8 18.2 18.2 LATEST_PACKAGE_SALE_PRICE
77 LOCID Num 8 7. 7. LOCID
25 MAP_BOOK Char 6 $6. $6. MAP_BOOK
26 MAP_PAGE Char 6 $6. $6. MAP_PAGE
100 MEZZANINE_FINISH Char 25 $25. $25. MEZZANINE_FINISH
139 NET_BILLABLE_VALUE Num 8 13. 13. NET_BILLABLE_VALUE
34 OLD_PARCEL_BLOCK Char 5 $5. $5. OLD_PARCEL_BLOCK
35 OLD_PARCEL_LOT Char 4 $4. $4. OLD_PARCEL_LOT
33 OLD_PARCEL_MAP Char 5 $5. $5. OLD_PARCEL_MAP
112 OUTBUILDING_INFO Char 2500 $2500. $2500. OUTBUILDING_INFO
113 OUTBUILDING_TYPE_AND_GRADE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_GRADE
114 OUTBUILDING_TYPE_AND_RATE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_RATE
115 OUTBUILDING_TYPE_AND_VALUE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_VALUE
60 OWNER1 Char 50 $50. $50. OWNER1
61 OWNER2 Char 50 $50. $50. OWNER2
12 PACKAGE_SALE_DATE Num 8 DATETIME20. DATETIME20. PACKAGE_SALE_DATE
79 PACKAGE_SALE_PER_SQ_FT Num 8 11. 11. PACKAGE_SALE_PER_SQ_FT
10 PACKAGE_SALE_PRICE Num 8 18.2 18.2 PACKAGE_SALE_PRICE
85 PACKAGE_SALE_RATIO Num 8 12.2 12.2 PACKAGE_SALE_RATIO
1 PARCEL_NUM Char 20 $20. $20. PARCEL_NUM
2 PARCEL_PK Num 8 11. 11. PARCEL_PK
32 PARCEL_STATUS Char 8 $8. $8. PARCEL_STATUS
5 PIN_EXT Char 4 $4. $4. PIN_EXT
28 PIN_MAP Char 10 $10. $10. PIN_MAP
4 PIN_NUM Char 14 $14. $14. PIN_NUM
43 PLANNING_JURISDICTION Char 8 $8. $8. PLANNING_JURISDICTION
132 PLANNING_LANDUSE_ABBREV Char 3 $3. $3. PLANNING_LANDUSE_ABBREV
131 PLANNING_LANDUSE_CODE Char 1 $1. $1. PLANNING_LANDUSE_CODE
133 PLANNING_LANDUSE_DESC Char 80 $80. $80. PLANNING_LANDUSE_DESC
67 PROPERTY_DESC Char 250 $250. $250. PROPERTY_DESC
87 PROPERTY_INDEX Char 75 $75. $75. PROPERTY_INDEX
119 REVAL_ESTIMATED_VALUE Num 8 11. 11. REVAL_ESTIMATED_VALUE
118 REVAL_FINAL_REVIEW_STATUS Char 3 $3. $3. REVAL_FINAL_REVIEW_STATUS
120 REVAL_REVIEWED_BY Char 10 $10. $10. REVAL_REVIEWED_BY
121 REVAL_REVIEWED_DATE Num 8 DATETIME20. DATETIME20. REVAL_REVIEWED_DATE
122 REVAL_TOTAL_PROP_VALUE Num 8 11. 11. REVAL_TOTAL_PROP_VALUE
123 REVAL_VCS_REVIEW_STATUS Char 250 $250. $250. REVAL_VCS_REVIEW_STATUS
107 ROOF_FLOOR_SYSTEM Char 12 $12. $12. ROOF_FLOOR_SYSTEM
47 SPECIAL_DISTRICT Char 8 $8. $8. SPECIAL_DISTRICT
111 SPECIAL_PROPERTY_FLAGS Char 500 $500. $500. SPECIAL_PROPERTY_FLAGS
31 STORY_HEIGHT Char 8 $8. $8. STORY_HEIGHT
75 STREET_MISC Char 6 $6. $6. STREET_MISC
72 STREET_NAME Char 25 $25. $25. STREET_NAME
70 STREET_NUM Num 8 7. 7. STREET_NUM
73 STREET_TYPE Char 4 $4. $4. STREET_TYPE
138 TAXRELIEF_VALUE Num 8 13. 13. TAXRELIEF_VALUE
7 TOTAL_BUILDINGS Num 8 4. 4. TOTAL_BUILDINGS
29 TOTAL_HEATED_AREA Num 8 TOTAL_HEATED_AREA
116 TOTAL_LANDLINES Num 8 4. 4. TOTAL_LANDLINES
30 TOTAL_LIVING_UNITS Num 8 TOTAL_LIVING_UNITS
56 TOTAL_OBLDG_VALUE_ASSD Num 8 10. 10. TOTAL_OBLDG_VALUE_ASSD
117 TOTAL_OUT_BUILDINGS Num 8 4. 4. TOTAL_OUT_BUILDINGS
38 TOTAL_PROP_VALUE_ASSD Num 8 13. 13. TOTAL_PROP_VALUE_ASSD
78 TOTAL_VALUE_PER_SQ_FT Num 8 11. 11. TOTAL_VALUE_PER_SQ_FT
125 TOWNHOUSE_DETACHED Char 3 $3. $3. TOWNHOUSE_DETACHED
124 TOWNHOUSE_END_UNIT Char 3 $3. $3. TOWNHOUSE_END_UNIT
44 TOWNSHIP Char 8 $8. $8. TOWNSHIP
45 TOWNSHIP_DESC Char 50 $50. $50. TOWNSHIP_DESC
6 VCS Char 7 $7. $7. VCS
40 YEAR_BUILT Num 8 5. 5. YEAR_BUILT
76 ZIP Char 5 $5. $5. ZIP
24 ZONING Char 10 $10. $10. ZONING
109 ZONING_CATEGORY Char 20 $20. $20. ZONING_CATEGORY
>>> sd = sas.sasdata('chris','x')
>>> sd; sd.obs()
Libref = x
Table = chris
Dsopts = {}
Results = Pandas
50000
>>>
>>> tracemalloc.start()
>>> df = sd.to_df()
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>>
>>>
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy = 1351015
>>>
>>>
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas = 18165160304046
>>>
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 139 entries, PARCEL_NUM to NET_BILLABLE_VALUE
dtypes: datetime64[ns](7), float64(47), object(85)
memory usage: 53.0+ MB
>>>
>>>
>>> tot_all = 0
>>> for stat in top_stats:
... #print(stat)
... tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all = 18165776279127
>>>
>>>
Hi Tom, Ok i send you my SAS table today. Thanks for these explanations ! Aurélien
De : Tom Weber @.> Envoyé : lundi 24 octobre 2022 19:30 À : sassoftware/saspy @.> Cc : THIERY Aurelien @.>; Mention @.> Objet : Re: [sassoftware/saspy] Convert big SAS Data to pandas dataframe take all RAM (Issue #494)
Ok, so I'm learning about this tracemalloc module as I go. Clearly, it's not keeping a running total of memory consumption; it's each malloc (with no subtracting for free's). In this next test case I ran w/ a different 1G dataset that has a mix of char and num and datetimes. This made an enormous difference in the stats. but clearly the 'total' I'm getting from the trace records isn't a cumulative number, sine it shows 18,165,368,718,822 as sum of all the alloc's. Clearly, it was freeing most of this along the way too. I don't see that this module can to a running total, so that's still best observed at the OS layer looking at process statistics.
I did run the 4G version of my previous dataset (all doubles), and it took 9G (total allocs over the duration), though I observed over 2G for the process, just by watching top from another terminal for that python process.
None of this is definitive, but it is showing significant, and kinda strange, memory usage creating the dataframe. In this case below, with different data coming over, saspy had 1.3G of allocations over the duration, while pandas had 18,165G of allocations over the duration. So, there's no telling from this how much at any one time, but clearly, pandas is allocating and reallocating no end of objects and mem while creating the dataframe.
So, again, I think trying this on your data, with reduced number of observations; incrementing and looking at memory consumption, while trying increasingly larger number of observations (and look at the SAS log in case there's something going on there) is the course to try. How pandas uses memory seems to be all over the place from just these couple tries I ran with different kinds of data. The mixed data (not just all doubles) seemed to explode the amount of allocations.
tom64-7> python3
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
import saspy
import tracemalloc
sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')
SAS Connection established. Subprocess id is 113946
sas
Access Method = IOM
SAS Config name = iomj
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_work1A5D0001BB01_tom64-7/SAS_workC33A0001BB01_tom64-7/
SAS Version = 9.04.01M8P10232022
SASPy Version = 4.3.5
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 113409
sas.saslib('x', path='/opt/tom')
7 The SAS System Monday, October 24, 2022 12:47:00 PM
34
35 libname x '/opt/tom' ;
NOTE: Libref X was successfully assigned as follows:
Engine: V9
Physical Name: /opt/tom
36
37
38
8 The SAS System Monday, October 24, 2022 12:47:00 PM
39
sd = sas.sasdata('chris','x')
sd; sd.obs()
Libref = x
Table = chris
Dsopts = {}
Results = Pandas
50000
sd.contents('text')
The SAS System Monday, October 24, 2022 12:19:00 PM 4
The CONTENTS Procedure
Data Set Name X.CHRIS Observations 50000
Member Type DATA Variables 139
Engine V9 Indexes 0
Created 10/04/2017 10:34:35 Observation Length 20368
Last Modified 10/04/2017 10:34:35 Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding latin1 Western (ISO)
Engine/Host Dependent Information
Data Set Page Size 245760
Number of Data Set Pages 4167
First Data Page 1
Max Obs per Page 12
Obs in First Data Page 10
Number of Data Set Repairs 0
Filename /opt/tom/chris.sas7bdat
Release Created 9.0401M4
Host Created X64_10PRO
Inode Number 412417536
Access Permission rwxr-xr-x
Owner Name sastpw
File Size 977MB
File Size (bytes) 1024327680
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
22 ACCRUED_MKT_ADJ_PCT Num 8 4. 4. ACCRUED_MKT_ADJ_PCT
23 ACCRUED_PERCENT Num 8 4. 4. ACCRUED_PERCENT
63 ADDRESS1 Char 100 $100. $100. ADDRESS1
64 ADDRESS2 Char 100 $100. $100. ADDRESS2
65 ADDRESS3 Char 100 $100. $100. ADDRESS3
66 ADDRESS4 Char 100 $100. $100. ADDRESS4
106 AIR Char 12 $12. $12. AIR
14 ALL_GRADES Char 1000 $1000. $1000. ALL_GRADES
99 ALL_INTERIOR_FINISHES Char 500 $500. $500. ALL_INTERIOR_FINISHES
18 ALL_LAND_DESCS Char 1000 $1000. $1000. ALL_LAND_DESCS
16 ALL_LAND_RATES Char 1000 $1000. $1000. ALL_LAND_RATES
101 ALL_MEZZANINE_FINISHES Char 500 $500. $500. ALL_MEZZANINE_FINISHES
108 ALL_ROOF_FLOOR_SYSTEM Char 500 $500. $500. ALL_ROOF_FLOOR_SYSTEM
127 ALL_ZONINGS Char 500 $500. $500. ALL_ZONINGS
95 ATTIC_FINISH Char 12 $12. $12. ATTIC_FINISH
96 ATTIC_FINISH_PERCENT Num 8 4. 4. ATTIC_FINISH_PERCENT
3 AUT_SNAPSHOT_DATE Num 8 DATETIME20. DATETIME20. AUT_SNAPSHOT_DATE
92 BASEMENT Char 12 $12. $12. BASEMENT
94 BASEMENT_FINISH Char 25 $25. $25. BASEMENT_FINISH
93 BASEMENT_PERCENT Num 8 4. 4. BASEMENT_PERCENT
128 BATHS Char 25 $25. $25. BATHS
68 BILLING_CLASS Num 8 BILLING_CLASS
69 BILLING_CLASS_DESC Char 20 $20. $20. BILLING_CLASS_DESC
52 BLDG_USE Char 8 $8. $8. BLDG_USE
53 BLDG_USE_DESC Char 20 $20. $20. BLDG_USE_DESC
37 BLDG_VALUE_ASSD Num 8 13. 13. BLDG_VALUE_ASSD
21 BMC_ADJUSTMENTS Char 1000 $1000. $1000. BMC_ADJUSTMENTS
126 CONDO_FACTORS Char 100 $100. $100. CONDO_FACTORS
41 CORP_LIMIT Char 8 $8. $8. CORP_LIMIT
42 CORP_LIMIT_DESC Char 50 $50. $50. CORP_LIMIT_DESC
39 DEEDED_ACREAGE Num 8 12.4 12.4 DEEDED_ACREAGE
57 DEED_BOOK Char 6 $6. $6. DEED_BOOK
59 DEED_DATE Num 8 DATETIME20. DATETIME20. DEED_DATE
58 DEED_PAGE Char 6 $6. $6. DEED_PAGE
27 DEPRECIATION_RATING Char 8 $8. $8. DEPRECIATION_RATING
54 DESIGN_STYLE Char 8 $8. $8. DESIGN_STYLE
55 DESIGN_STYLE_DESC Char 20 $20. $20. DESIGN_STYLE_DESC
71 DIRECTIONAL_PREFIX Char 2 $2. $2. DIRECTIONAL_PREFIX
74 DIRECTIONAL_SUFFIX Char 2 $2. $2. DIRECTIONAL_SUFFIX
8 EFFECTIVE_YEAR Num 8 5. 5. EFFECTIVE_YEAR
134 EXCLUSION_VALUE Num 8 13. 13. EXCLUSION_VALUE
48 EXEMPT_STATUS Char 8 $8. $8. EXEMPT_STATUS
49 EXEMPT_STATUS_DESC Char 20 $20. $20. EXEMPT_STATUS_DESC
135 EXEMPT_VALUE Num 8 13. 13. EXEMPT_VALUE
97 EXTERIOR_FINISH Char 12 $12. $12. EXTERIOR_FINISH
129 FIREPLACE_TYPE1 Num 8 4. 4. FIREPLACE_TYPE1
130 FIREPLACE_TYPE2 Num 8 4. 4. FIREPLACE_TYPE2
46 FIRE_DISTRICT Char 8 $8. $8. FIRE_DISTRICT
62 FULL_ADDRESS Char 1000 $1000. $1000. FULL_ADDRESS
13 GRADE Char 8 $8. $8. GRADE
105 HEAT Char 12 $12. $12. HEAT
137 HISTDEF_VALUE Num 8 13. 13. HISTDEF_VALUE
110 IGNORE_UNITS Char 3 $3. $3. IGNORE_UNITS
102 INCOME_INDICATED_VALUE Num 8 14.2 14.2 INCOME_INDICATED_VALUE
104 INCOME_PERS_PROPERTY_VALUE Num 8 13.2 13.2 INCOME_PERS_PROPERTY_VALUE
103 INCOME_RESIDUAL_LAND_VALUE Num 8 13.2 13.2 INCOME_RESIDUAL_LAND_VALUE
98 INTERIOR_FINISH Char 25 $25. $25. INTERIOR_FINISH
136 LANDDEF_VALUE Num 8 13. 13. LANDDEF_VALUE
50 LAND_CLASS Char 8 $8. $8. LAND_CLASS
51 LAND_CLASS_DESC Char 20 $20. $20. LAND_CLASS_DESC
19 LAND_CONDITIONS Char 1000 $1000. $1000. LAND_CONDITIONS
17 LAND_DESC Char 8 $8. $8. LAND_DESC
20 LAND_INFLUENCES Char 1000 $1000. $1000. LAND_INFLUENCES
15 LAND_RATE Num 8 16.2 16.2 LAND_RATE
11 LAND_SALE_DATE Num 8 DATETIME20. DATETIME20. LAND_SALE_DATE
9 LAND_SALE_PRICE Num 8 18.2 18.2 LAND_SALE_PRICE
83 LAND_SALE_PRICE_PER_ACRE Num 8 11. 11. LAND_SALE_PRICE_PER_ACRE
80 LAND_SALE_PRICE_PER_SQ_FT Num 8 12.2 12.2 LAND_SALE_PRICE_PER_SQ_FT
84 LAND_SALE_RATIO Num 8 12.2 12.2 LAND_SALE_RATIO
86 LAND_TO_BLDG_RATIO Num 8 12.2 12.2 LAND_TO_BLDG_RATIO
36 LAND_VALUE_ASSD Num 8 13. 13. LAND_VALUE_ASSD
82 LAND_VALUE_PER_ACRE Num 8 11. 11. LAND_VALUE_PER_ACRE
81 LAND_VALUE_PER_SQ_FT Num 8 12.2 12.2 LAND_VALUE_PER_SQ_FT
90 LATEST_LAND_SALE_DATE Num 8 DATETIME20. DATETIME20. LATEST_LAND_SALE_DATE
88 LATEST_LAND_SALE_PRICE Num 8 18.2 18.2 LATEST_LAND_SALE_PRICE
91 LATEST_PACKAGE_SALE_DATE Num 8 DATETIME20. DATETIME20. LATEST_PACKAGE_SALE_DATE
89 LATEST_PACKAGE_SALE_PRICE Num 8 18.2 18.2 LATEST_PACKAGE_SALE_PRICE
77 LOCID Num 8 7. 7. LOCID
25 MAP_BOOK Char 6 $6. $6. MAP_BOOK
26 MAP_PAGE Char 6 $6. $6. MAP_PAGE
100 MEZZANINE_FINISH Char 25 $25. $25. MEZZANINE_FINISH
139 NET_BILLABLE_VALUE Num 8 13. 13. NET_BILLABLE_VALUE
34 OLD_PARCEL_BLOCK Char 5 $5. $5. OLD_PARCEL_BLOCK
35 OLD_PARCEL_LOT Char 4 $4. $4. OLD_PARCEL_LOT
33 OLD_PARCEL_MAP Char 5 $5. $5. OLD_PARCEL_MAP
112 OUTBUILDING_INFO Char 2500 $2500. $2500. OUTBUILDING_INFO
113 OUTBUILDING_TYPE_AND_GRADE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_GRADE
114 OUTBUILDING_TYPE_AND_RATE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_RATE
115 OUTBUILDING_TYPE_AND_VALUE Char 2000 $2000. $2000. OUTBUILDING_TYPE_AND_VALUE
60 OWNER1 Char 50 $50. $50. OWNER1
61 OWNER2 Char 50 $50. $50. OWNER2
12 PACKAGE_SALE_DATE Num 8 DATETIME20. DATETIME20. PACKAGE_SALE_DATE
79 PACKAGE_SALE_PER_SQ_FT Num 8 11. 11. PACKAGE_SALE_PER_SQ_FT
10 PACKAGE_SALE_PRICE Num 8 18.2 18.2 PACKAGE_SALE_PRICE
85 PACKAGE_SALE_RATIO Num 8 12.2 12.2 PACKAGE_SALE_RATIO
1 PARCEL_NUM Char 20 $20. $20. PARCEL_NUM
2 PARCEL_PK Num 8 11. 11. PARCEL_PK
32 PARCEL_STATUS Char 8 $8. $8. PARCEL_STATUS
5 PIN_EXT Char 4 $4. $4. PIN_EXT
28 PIN_MAP Char 10 $10. $10. PIN_MAP
4 PIN_NUM Char 14 $14. $14. PIN_NUM
43 PLANNING_JURISDICTION Char 8 $8. $8. PLANNING_JURISDICTION
132 PLANNING_LANDUSE_ABBREV Char 3 $3. $3. PLANNING_LANDUSE_ABBREV
131 PLANNING_LANDUSE_CODE Char 1 $1. $1. PLANNING_LANDUSE_CODE
133 PLANNING_LANDUSE_DESC Char 80 $80. $80. PLANNING_LANDUSE_DESC
67 PROPERTY_DESC Char 250 $250. $250. PROPERTY_DESC
87 PROPERTY_INDEX Char 75 $75. $75. PROPERTY_INDEX
119 REVAL_ESTIMATED_VALUE Num 8 11. 11. REVAL_ESTIMATED_VALUE
118 REVAL_FINAL_REVIEW_STATUS Char 3 $3. $3. REVAL_FINAL_REVIEW_STATUS
120 REVAL_REVIEWED_BY Char 10 $10. $10. REVAL_REVIEWED_BY
121 REVAL_REVIEWED_DATE Num 8 DATETIME20. DATETIME20. REVAL_REVIEWED_DATE
122 REVAL_TOTAL_PROP_VALUE Num 8 11. 11. REVAL_TOTAL_PROP_VALUE
123 REVAL_VCS_REVIEW_STATUS Char 250 $250. $250. REVAL_VCS_REVIEW_STATUS
107 ROOF_FLOOR_SYSTEM Char 12 $12. $12. ROOF_FLOOR_SYSTEM
47 SPECIAL_DISTRICT Char 8 $8. $8. SPECIAL_DISTRICT
111 SPECIAL_PROPERTY_FLAGS Char 500 $500. $500. SPECIAL_PROPERTY_FLAGS
31 STORY_HEIGHT Char 8 $8. $8. STORY_HEIGHT
75 STREET_MISC Char 6 $6. $6. STREET_MISC
72 STREET_NAME Char 25 $25. $25. STREET_NAME
70 STREET_NUM Num 8 7. 7. STREET_NUM
73 STREET_TYPE Char 4 $4. $4. STREET_TYPE
138 TAXRELIEF_VALUE Num 8 13. 13. TAXRELIEF_VALUE
7 TOTAL_BUILDINGS Num 8 4. 4. TOTAL_BUILDINGS
29 TOTAL_HEATED_AREA Num 8 TOTAL_HEATED_AREA
116 TOTAL_LANDLINES Num 8 4. 4. TOTAL_LANDLINES
30 TOTAL_LIVING_UNITS Num 8 TOTAL_LIVING_UNITS
56 TOTAL_OBLDG_VALUE_ASSD Num 8 10. 10. TOTAL_OBLDG_VALUE_ASSD
117 TOTAL_OUT_BUILDINGS Num 8 4. 4. TOTAL_OUT_BUILDINGS
38 TOTAL_PROP_VALUE_ASSD Num 8 13. 13. TOTAL_PROP_VALUE_ASSD
78 TOTAL_VALUE_PER_SQ_FT Num 8 11. 11. TOTAL_VALUE_PER_SQ_FT
125 TOWNHOUSE_DETACHED Char 3 $3. $3. TOWNHOUSE_DETACHED
124 TOWNHOUSE_END_UNIT Char 3 $3. $3. TOWNHOUSE_END_UNIT
44 TOWNSHIP Char 8 $8. $8. TOWNSHIP
45 TOWNSHIP_DESC Char 50 $50. $50. TOWNSHIP_DESC
6 VCS Char 7 $7. $7. VCS
40 YEAR_BUILT Num 8 5. 5. YEAR_BUILT
76 ZIP Char 5 $5. $5. ZIP
24 ZONING Char 10 $10. $10. ZONING
109 ZONING_CATEGORY Char 20 $20. $20. ZONING_CATEGORY
sd = sas.sasdata('chris','x')
sd; sd.obs()
Libref = x
Table = chris
Dsopts = {}
Results = Pandas
50000
tracemalloc.start()
df = sd.to_df()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
tot_saspy = 0
for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_saspy += stat.size * stat.count
...
print("saspy = ", str(tot_saspy))
saspy = 1351015
tot_pandas = 0
for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_pandas += stat.size * stat.count
...
print("pandas = ", str(tot_pandas))
pandas = 18165160304046
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 139 entries, PARCEL_NUM to NET_BILLABLE_VALUE
dtypes: datetime64ns, float64(47), object(85)
memory usage: 53.0+ MB
tot_all = 0
for stat in top_stats:
... #print(stat)
... tot_all += stat.size * stat.count
...
print("all = ", str(tot_all))
all = 18165776279127
— Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/saspy/issues/494#issuecomment-1289360697, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD5A4KARM5UDTZWD4MJS2LLWE3BRLANCNFSM6AAAAAARJO2FPQ. You are receiving this because you were mentioned.Message ID: @.**@.>>
Got it! I'm in an all day thing today at work, but I will use this to investigate and see what I can find asap. Thanks! Tom
Well, I spent some time last night running some comparisons, using that tracemalloc in python and using top -d1 to observe cpu and memory consumption at the process level. tracemalloc output, from what I was trying, shows all of the allocations from the various code running but not free's, so the numbers, when added up, can only be interpreted relative to themselves as an indication of how much overall work (thrashing?) the method is doing with regards to allocating and freeing memory - object creation and reclamation. Seeing the numbers add up to Terabytes and Petabytes would seem crazy if you didn't take this into account.
The table below shows a comparison of using 1 million rows from your data set. I used sas.sd2df() to read it in (which was successful since I didn't run out of mem trying), then I used pandas to write the df out to a csv file (with it's defaults). I then read that csv file in, using pandas to compare the consumption that pandas used without me in the equation. That's about as equal as I could make this. I also ran sas.sd2df() and pandas.read_csv() from a fresh Python session, of course, so they would be as equal as can be also.
What's shown by this is that the dataframe, regardless of how it was created ends up taking 2.1G of memory once it's finished being created. In both cases, the highest resident memory consumed while creating the dataframe was over 11G. The total number of allocation by pandas (again, ignoring the free's) was .8/.9 Petabytes for both cases. I don't know pandas enough to do anything but speculate, but what this says to me is that there is a lot of object creation and re-creation going on while pandas is parsing the data and generating the dataframe. Also, it shows that it's pretty much equal regardless of if saspy is in the mix or not.
1M rows from mvtpra4 as either SAS dataset or csv file
one using saspy other only pandas by itself w/ it's own csv (created via: df.to_csv() )
these are from tracemalloc
final DF mem process resident mem saspy pandas other
(df.info()) (highest observed)
sas.sd2df() 2.1G 11.2G 4.6G 834.5T 5.2G
pandas.read_csv() 2.1G 11.4G N/A 976.1T 49.418P
The next test I was going to do was to try to use tracemalloc in my code and try to do some of the example to try and get relative memory consumption at the pandas read level, hoping to try and get better info from that tool. However, since I observed the same memory consumption by pandas regardless of if it was from using sas.sd2df or pandas.read_csv, it seems moot. Pandas is consuming the memory, many times the final size of the dataframe. saspy made no appreciable difference in that.
So, I'm not sure that there's anything I can do to make this behave better. Just seems to be how pandas behaves. Here's the runs I did to get this information, for the table above.
tom64-7> python3
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> import tracemalloc
>>> sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')
SAS Connection established. Subprocess id is 128294
>>> sas
Access Method = IOM
SAS Config name = iomj
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_workBE2B0001F4D1_tom64-7/SAS_work009A0001F4D1_tom64-7/
SAS Version = 9.04.01M8P10252022
SASPy Version = 4.3.5
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 128209
>>> sas.saslib('x', path='/opt/tom')
7 The SAS System Tuesday, October 25, 2022 09:11:00 PM
34
35 libname x '/opt/tom' ;
NOTE: Libref X was successfully assigned as follows:
Engine: V9
Physical Name: /opt/tom
36
37
38
8 The SAS System Tuesday, October 25, 2022 09:11:00 PM
39
>>> sd = sas.sasdata('mvtpra4','x')
>>> sd; sd.obs()
Libref = x
Table = mvtpra4
Dsopts = {}
Results = Pandas
7905149
>>> sd.dsopts={'obs':1000000}
>>> sd; sd.obs()
Libref = x
Table = mvtpra4
Dsopts = {'obs': 1000000}
Results = Pandas
1000000
>>> tracemalloc.start()
>>> df = sd.to_df()
11.1G res
>>> #df = pandas.read_csv('/opt/tom/mvtpra4_1m.csv')
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>>
>>> 11.1G res
File "<stdin>", line 1
11.1G res
^
SyntaxError: invalid syntax
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy = 4637584
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
... #print(stat)
... tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas = 834484755739372
>>>
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> tot_all = 0
>>> for stat in top_stats:
... #print(stat)
... tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all = 834537123615182
>>>
>>>
>>> 834537123615182 - 834484755739372
52367875810
>>> 52,367,875,810
(52, 367, 875, 810)
>>> 834,484,755,739,372
(834, 484, 755, 739, 372)
>>>
>>> df
[...]
[1000000 rows x 281 columns]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> df.to_csv("/opt/tom/mvtpra4_1m.csv") # use pandas to write it out to it's .csv file to read back in next
>>>
tom64-7> ll /opt/tom/mvtpra4_1m.csv
-rw-r--r-- 1 sastpw r&d 1101778094 Oct 25 23:43 /opt/tom/mvtpra4_1m.csv
tom64-7>
tom64-7> ll -h /opt/tom/mvtpra4_1m.csv
-rw-r--r-- 1 sastpw r&d 1.1G Oct 25 23:43 /opt/tom/mvtpra4_1m.csv
tom64-7>
tom64-7> python3
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import tracemalloc
>>> tracemalloc.start()
>>> #df = sd.to_df()
>>> df = pandas.read_csv('/opt/tom/mvtpra4_1m.csv')
<stdin>:1: DtypeWarning: Columns (11,21,29,39,57,58,59,62,63,67,73,79,81,82,83,88,89,91,94,96,
102,104,109,110,111,118,132,134,135,136,137,146,160,161,162,165,183,209,210,211,212,213,214,215,
216,217,221,222,227,237,261) have mixed types. Specify dtype option on import or set low_memory=False.
at 6G
>>>
>>> snapshot = tracemalloc.take_snapshot()
11.4G
>>> top_stats = snapshot.statistics('lineno')
>>>
>>> at 6G
File "<stdin>", line 1
at 6G
^
SyntaxError: invalid syntax
>>> 11.4G
File "<stdin>", line 1
11.4G
^
SyntaxError: invalid syntax
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
#print(stat)
tot_pandas += stat.size * stat.count
... #print(stat)
... tot_pandas += stat.size * stat.count
...
print("pandas = ", str(tot_pandas))
>>> print("pandas = ", str(tot_pandas))
tot_all = 0
pandas = 976121237536279
>>>
for stat in top_stats:
#print(stat)
>>> df.info()
tot_all += stat.size * stat.count
print("all = ", str(tot_all))
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 282 entries, Unnamed: 0 to MAJO_FRANC_CLIENT
dtypes: float64(91), int64(28), object(163)
memory usage: 2.1+ GB
>>>
>>>
>>> tot_all = 0
>>> for stat in top_stats:
... #print(stat)
... tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all = 50394662513462493
>>>
>>>
>>> 50,394,662,513,462,493
(50, 394, 662, 513, 462, 493)
>>>
>>> 50394662513462493 - 976121237536279
49418541275926214
>>> 49,418,541,275,926,214
(49, 418, 541, 275, 926, 214)
>>>
>>> 976,121,237,536,279
(976, 121, 237, 536, 279)
>>>
@AurelienThiery we're you able to see the same thing on your end? I'm at a loss as to how I can influence Pandas to consume less memory while it instantiates the dataframe. I did try specifying different engine= values. Python made it way worse, C is what I use, and the default. And the new experimental pyarrow did keep memory way down but ran so long I just killed it. I didn't see anything else that would help.
Any thoughts? I'm not sure what I can do.
Thanks, Tom
Hey @AurelienThiery, I'm gonna close this, just to clean things up. I can't see anything I can do to make Pandas not use so much memory, and that's the same when saspy isn't even involved. Let me know if you have any other thoughts, or want to investigate more.
Thanks, Tom
Hi Tom, I 'm coming back to you with other bug (still present in last version). When i convert "big" SAS Data (6 GB on the disk : 7 905 149 lines, 281 columns) to pandas dataframe, the operation takes all the server's RAM.
I run this code (with docker on linux server):
_import pandas as pd import saspy import time print('Ouverture de la session SAS sur le serveur...') sas = saspy.SASsession(cfgfile='/opt/sascfg_personal.py') print('Allocation de la bibliothèque...') sas.saslib(libref='mylib', path='H:\Infocentre\DSIALI\01_Datawares\Date\data2022\data0922') print('Chargement de la table SAS dans un dataframe...') start = time.time() sd = sas.sasdata('mvtpra4', 'mylib') print("Conversion de l'objet SAS Data en Dataframe...") df = sd.todf() print('Temps de chargement = ', time.time()-start) print(df.head())
The step « df = sd.to_df() » (to convert sas data in pandas dataframe object) finished in error and all server RAM was full !! (64GB).
I put in this ticket the log file and a file showing RAM evolution (parallel session on the same server).
Thanks. Aurélien
log.txt RAM_evolution.txt