sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
367 stars 149 forks source link

Convert big SAS Data to pandas dataframe take all RAM #494

Closed AurelienThiery closed 1 year ago

AurelienThiery commented 1 year ago

Hi Tom, I 'm coming back to you with other bug (still present in last version). When i convert "big" SAS Data (6 GB on the disk : 7 905 149 lines, 281 columns) to pandas dataframe, the operation takes all the server's RAM.

I run this code (with docker on linux server):

_import pandas as pd import saspy import time print('Ouverture de la session SAS sur le serveur...') sas = saspy.SASsession(cfgfile='/opt/sascfg_personal.py') print('Allocation de la bibliothèque...') sas.saslib(libref='mylib', path='H:\Infocentre\DSIALI\01_Datawares\Date\data2022\data0922') print('Chargement de la table SAS dans un dataframe...') start = time.time() sd = sas.sasdata('mvtpra4', 'mylib') print("Conversion de l'objet SAS Data en Dataframe...") df = sd.todf() print('Temps de chargement = ', time.time()-start) print(df.head())

The step « df = sd.to_df() » (to convert sas data in pandas dataframe object) finished in error and all server RAM was full !! (64GB).

I put in this ticket the log file and a file showing RAM evolution (parallel session on the same server).

Thanks. Aurélien

log.txt RAM_evolution.txt

tomweber-sas commented 1 year ago

hey @AurelienThiery, I won't be back in the office till Monday. But I'll take a look at this then! Thanks, Tom

tomweber-sas commented 1 year ago

Hey @AurelienThiery, I'm back and have been looking at the traceback and my code to see if I can account for any kind of memory leak, but I don't see anything. Since this is just filling up mem until it terminates the process, we can't see anything that might help. Have you tried anything diagnostic? Like using obs to only load half or a quarter of the table and then (assuming that worked), look at the log to see if there was anything out of the ordinary? And see how much memory Pandas uses for however much you loaded?

To explain the traceback a little, In my sd2df() once I've figured out all the metadata about the SAS dataset, and I start to stream the data, I call pandas read_csv (really, read delimited data with a ton of options you control) and pass it a filelike object (my socket reading object), and pandas just calls my read() (as if it were read() from a file), reading the stream and creating the dataframe from the data it's reading. I'm not seeing anything in my code that would be keeping any of the data around (helping fill memory while pandas is trying to build the df). So, I'm not sure why pandas is running out of memory. Though I have no idea how it uses mem while building the df, or how much mem or the objects take up for the df. I'll try to see if I can figure a way to debug the memory consumption. But, I think the next best thig is to get a better idea of if there's something going wrong for your case; like the other where there was an error in SAS but all the data transfer after filled the SASLOG and that caused the OOM.

So, can you try running this with less that the whole table so we can see the log after and see how much memory is really used for pandas to create the df? Work it from the other direction and see what we see. Start small and build up, seeing if the transient mem grows unusually or consistently. And see the SASLOG and see if the final mem for the df is less than the transient mem used while creating it.

Thanks, Tom

tomweber-sas commented 1 year ago

I was able to use tracemalloc to get a look at the memory consumption for the sd2df() call, for a test case I'm playing with. Pandas consumes 2 times the mem of the final dataframe while building it (I've seen it hit 3 times consumption too), while the memory accounted to saspy is less then 1M. This is with a sas dataset that's about 1G on disk (115 numeric columns and 1M rows). I'll see about running it on bigger data too, but here's the results from what I was doing. You can try the same on your test case (smaller data that will finish instead of abend the python process).

import tracemalloc
>>> sd = sas.sasdata('saspy1g','x')
>>> sd; sd.obs()
Libref  = x
Table   = saspy1g
Dsopts  = {}
Results = Pandas

1000000
>>> tracemalloc.start()
>>> df = sd.to_df()
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy =  830705
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas =  1844859927
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 115 entries, xIn1 to yBinary
dtypes: float64(115)
memory usage: 877.4 MB
>>>
>>> tot_all = 0
>>> for stat in top_stats:
...    #print(stat)
...    tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all =  1873113836
>>>
tomweber-sas commented 1 year ago

Ok, so I'm learning about this tracemalloc module as I go. Clearly, it's not keeping a running total of memory consumption; it's each malloc (with no subtracting for free's). In this next test case I ran w/ a different 1G dataset that has a mix of char and num and datetimes. This made an enormous difference in the stats. but clearly the 'total' I'm getting from the trace records isn't a cumulative number, sine it shows 18,165,368,718,822 as sum of all the alloc's. Clearly, it was freeing most of this along the way too. I don't see that this module can to a running total, so that's still best observed at the OS layer looking at process statistics.

I did run the 4G version of my previous dataset (all doubles), and it took 9G (total allocs over the duration), though I observed over 2G for the process, just by watching top from another terminal for that python process.

None of this is definitive, but it is showing significant, and kinda strange, memory usage creating the dataframe. In this case below, with different data coming over, saspy had 1.3G of allocations over the duration, while pandas had 18,165G of allocations over the duration. So, there's no telling from this how much at any one time, but clearly, pandas is allocating and reallocating no end of objects and mem while creating the dataframe.

So, again, I think trying this on your data, with reduced number of observations; incrementing and looking at memory consumption, while trying increasingly larger number of observations (and look at the SAS log in case there's something going on there) is the course to try. How pandas uses memory seems to be all over the place from just these couple tries I ran with different kinds of data. The mixed data (not just all doubles) seemed to explode the amount of allocations.

tom64-7> python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> import tracemalloc
>>>
>>> sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')
SAS Connection established. Subprocess id is 113946

>>> sas
Access Method         = IOM
SAS Config name       = iomj
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_work1A5D0001BB01_tom64-7/SAS_workC33A0001BB01_tom64-7/
SAS Version           = 9.04.01M8P10232022
SASPy Version         = 4.3.5
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 113409

>>>
>>> sas.saslib('x', path='/opt/tom')

7                                                          The SAS System                       Monday, October 24, 2022 12:47:00 PM

34
35         libname x    '/opt/tom'  ;
NOTE: Libref X was successfully assigned as follows:
      Engine:        V9
      Physical Name: /opt/tom
36
37
38

8                                                          The SAS System                       Monday, October 24, 2022 12:47:00 PM

39
>>> sd = sas.sasdata('chris','x')
>>> sd; sd.obs()
Libref  = x
Table   = chris
Dsopts  = {}
Results = Pandas

50000

>>> sd.contents('text')

                                                           The SAS System                   Monday, October 24, 2022 12:19:00 PM   4

                                                       The CONTENTS Procedure

             Data Set Name        X.CHRIS                                                  Observations          50000
             Member Type          DATA                                                     Variables             139
             Engine               V9                                                       Indexes               0
             Created              10/04/2017 10:34:35                                      Observation Length    20368
             Last Modified        10/04/2017 10:34:35                                      Deleted Observations  0
             Protection                                                                    Compressed            NO
             Data Set Type                                                                 Sorted                NO
             Label
             Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
             Encoding             latin1  Western (ISO)

                                                 Engine/Host Dependent Information

                                        Data Set Page Size          245760
                                        Number of Data Set Pages    4167
                                        First Data Page             1
                                        Max Obs per Page            12
                                        Obs in First Data Page      10
                                        Number of Data Set Repairs  0
                                        Filename                    /opt/tom/chris.sas7bdat
                                        Release Created             9.0401M4
                                        Host Created                X64_10PRO
                                        Inode Number                412417536
                                        Access Permission           rwxr-xr-x
                                        Owner Name                  sastpw
                                        File Size                   977MB
                                        File Size (bytes)           1024327680

                                            Alphabetic List of Variables and Attributes

             #    Variable                      Type     Len    Format         Informat       Label

            22    ACCRUED_MKT_ADJ_PCT           Num        8    4.             4.             ACCRUED_MKT_ADJ_PCT
            23    ACCRUED_PERCENT               Num        8    4.             4.             ACCRUED_PERCENT
            63    ADDRESS1                      Char     100    $100.          $100.          ADDRESS1
            64    ADDRESS2                      Char     100    $100.          $100.          ADDRESS2
            65    ADDRESS3                      Char     100    $100.          $100.          ADDRESS3
            66    ADDRESS4                      Char     100    $100.          $100.          ADDRESS4
           106    AIR                           Char      12    $12.           $12.           AIR
            14    ALL_GRADES                    Char    1000    $1000.         $1000.         ALL_GRADES
            99    ALL_INTERIOR_FINISHES         Char     500    $500.          $500.          ALL_INTERIOR_FINISHES
            18    ALL_LAND_DESCS                Char    1000    $1000.         $1000.         ALL_LAND_DESCS
            16    ALL_LAND_RATES                Char    1000    $1000.         $1000.         ALL_LAND_RATES
           101    ALL_MEZZANINE_FINISHES        Char     500    $500.          $500.          ALL_MEZZANINE_FINISHES
           108    ALL_ROOF_FLOOR_SYSTEM         Char     500    $500.          $500.          ALL_ROOF_FLOOR_SYSTEM
           127    ALL_ZONINGS                   Char     500    $500.          $500.          ALL_ZONINGS
            95    ATTIC_FINISH                  Char      12    $12.           $12.           ATTIC_FINISH
            96    ATTIC_FINISH_PERCENT          Num        8    4.             4.             ATTIC_FINISH_PERCENT
             3    AUT_SNAPSHOT_DATE             Num        8    DATETIME20.    DATETIME20.    AUT_SNAPSHOT_DATE
            92    BASEMENT                      Char      12    $12.           $12.           BASEMENT
            94    BASEMENT_FINISH               Char      25    $25.           $25.           BASEMENT_FINISH
            93    BASEMENT_PERCENT              Num        8    4.             4.             BASEMENT_PERCENT
           128    BATHS                         Char      25    $25.           $25.           BATHS
            68    BILLING_CLASS                 Num        8                                  BILLING_CLASS
            69    BILLING_CLASS_DESC            Char      20    $20.           $20.           BILLING_CLASS_DESC
            52    BLDG_USE                      Char       8    $8.            $8.            BLDG_USE
            53    BLDG_USE_DESC                 Char      20    $20.           $20.           BLDG_USE_DESC
            37    BLDG_VALUE_ASSD               Num        8    13.            13.            BLDG_VALUE_ASSD
            21    BMC_ADJUSTMENTS               Char    1000    $1000.         $1000.         BMC_ADJUSTMENTS
           126    CONDO_FACTORS                 Char     100    $100.          $100.          CONDO_FACTORS
            41    CORP_LIMIT                    Char       8    $8.            $8.            CORP_LIMIT
            42    CORP_LIMIT_DESC               Char      50    $50.           $50.           CORP_LIMIT_DESC
            39    DEEDED_ACREAGE                Num        8    12.4           12.4           DEEDED_ACREAGE
            57    DEED_BOOK                     Char       6    $6.            $6.            DEED_BOOK
            59    DEED_DATE                     Num        8    DATETIME20.    DATETIME20.    DEED_DATE
            58    DEED_PAGE                     Char       6    $6.            $6.            DEED_PAGE
            27    DEPRECIATION_RATING           Char       8    $8.            $8.            DEPRECIATION_RATING
            54    DESIGN_STYLE                  Char       8    $8.            $8.            DESIGN_STYLE
            55    DESIGN_STYLE_DESC             Char      20    $20.           $20.           DESIGN_STYLE_DESC
            71    DIRECTIONAL_PREFIX            Char       2    $2.            $2.            DIRECTIONAL_PREFIX
            74    DIRECTIONAL_SUFFIX            Char       2    $2.            $2.            DIRECTIONAL_SUFFIX
             8    EFFECTIVE_YEAR                Num        8    5.             5.             EFFECTIVE_YEAR
           134    EXCLUSION_VALUE               Num        8    13.            13.            EXCLUSION_VALUE
            48    EXEMPT_STATUS                 Char       8    $8.            $8.            EXEMPT_STATUS
            49    EXEMPT_STATUS_DESC            Char      20    $20.           $20.           EXEMPT_STATUS_DESC
           135    EXEMPT_VALUE                  Num        8    13.            13.            EXEMPT_VALUE
            97    EXTERIOR_FINISH               Char      12    $12.           $12.           EXTERIOR_FINISH
           129    FIREPLACE_TYPE1               Num        8    4.             4.             FIREPLACE_TYPE1
           130    FIREPLACE_TYPE2               Num        8    4.             4.             FIREPLACE_TYPE2
            46    FIRE_DISTRICT                 Char       8    $8.            $8.            FIRE_DISTRICT
            62    FULL_ADDRESS                  Char    1000    $1000.         $1000.         FULL_ADDRESS
            13    GRADE                         Char       8    $8.            $8.            GRADE
           105    HEAT                          Char      12    $12.           $12.           HEAT
           137    HISTDEF_VALUE                 Num        8    13.            13.            HISTDEF_VALUE
           110    IGNORE_UNITS                  Char       3    $3.            $3.            IGNORE_UNITS
           102    INCOME_INDICATED_VALUE        Num        8    14.2           14.2           INCOME_INDICATED_VALUE
           104    INCOME_PERS_PROPERTY_VALUE    Num        8    13.2           13.2           INCOME_PERS_PROPERTY_VALUE
           103    INCOME_RESIDUAL_LAND_VALUE    Num        8    13.2           13.2           INCOME_RESIDUAL_LAND_VALUE
            98    INTERIOR_FINISH               Char      25    $25.           $25.           INTERIOR_FINISH
           136    LANDDEF_VALUE                 Num        8    13.            13.            LANDDEF_VALUE
            50    LAND_CLASS                    Char       8    $8.            $8.            LAND_CLASS
            51    LAND_CLASS_DESC               Char      20    $20.           $20.           LAND_CLASS_DESC
            19    LAND_CONDITIONS               Char    1000    $1000.         $1000.         LAND_CONDITIONS
            17    LAND_DESC                     Char       8    $8.            $8.            LAND_DESC
            20    LAND_INFLUENCES               Char    1000    $1000.         $1000.         LAND_INFLUENCES
            15    LAND_RATE                     Num        8    16.2           16.2           LAND_RATE
            11    LAND_SALE_DATE                Num        8    DATETIME20.    DATETIME20.    LAND_SALE_DATE
             9    LAND_SALE_PRICE               Num        8    18.2           18.2           LAND_SALE_PRICE
            83    LAND_SALE_PRICE_PER_ACRE      Num        8    11.            11.            LAND_SALE_PRICE_PER_ACRE
            80    LAND_SALE_PRICE_PER_SQ_FT     Num        8    12.2           12.2           LAND_SALE_PRICE_PER_SQ_FT
            84    LAND_SALE_RATIO               Num        8    12.2           12.2           LAND_SALE_RATIO
            86    LAND_TO_BLDG_RATIO            Num        8    12.2           12.2           LAND_TO_BLDG_RATIO
            36    LAND_VALUE_ASSD               Num        8    13.            13.            LAND_VALUE_ASSD
            82    LAND_VALUE_PER_ACRE           Num        8    11.            11.            LAND_VALUE_PER_ACRE
            81    LAND_VALUE_PER_SQ_FT          Num        8    12.2           12.2           LAND_VALUE_PER_SQ_FT
            90    LATEST_LAND_SALE_DATE         Num        8    DATETIME20.    DATETIME20.    LATEST_LAND_SALE_DATE
            88    LATEST_LAND_SALE_PRICE        Num        8    18.2           18.2           LATEST_LAND_SALE_PRICE
            91    LATEST_PACKAGE_SALE_DATE      Num        8    DATETIME20.    DATETIME20.    LATEST_PACKAGE_SALE_DATE
            89    LATEST_PACKAGE_SALE_PRICE     Num        8    18.2           18.2           LATEST_PACKAGE_SALE_PRICE
            77    LOCID                         Num        8    7.             7.             LOCID
            25    MAP_BOOK                      Char       6    $6.            $6.            MAP_BOOK
            26    MAP_PAGE                      Char       6    $6.            $6.            MAP_PAGE
           100    MEZZANINE_FINISH              Char      25    $25.           $25.           MEZZANINE_FINISH
           139    NET_BILLABLE_VALUE            Num        8    13.            13.            NET_BILLABLE_VALUE
            34    OLD_PARCEL_BLOCK              Char       5    $5.            $5.            OLD_PARCEL_BLOCK
            35    OLD_PARCEL_LOT                Char       4    $4.            $4.            OLD_PARCEL_LOT
            33    OLD_PARCEL_MAP                Char       5    $5.            $5.            OLD_PARCEL_MAP
           112    OUTBUILDING_INFO              Char    2500    $2500.         $2500.         OUTBUILDING_INFO
           113    OUTBUILDING_TYPE_AND_GRADE    Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_GRADE
           114    OUTBUILDING_TYPE_AND_RATE     Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_RATE
           115    OUTBUILDING_TYPE_AND_VALUE    Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_VALUE
            60    OWNER1                        Char      50    $50.           $50.           OWNER1
            61    OWNER2                        Char      50    $50.           $50.           OWNER2
            12    PACKAGE_SALE_DATE             Num        8    DATETIME20.    DATETIME20.    PACKAGE_SALE_DATE
            79    PACKAGE_SALE_PER_SQ_FT        Num        8    11.            11.            PACKAGE_SALE_PER_SQ_FT
            10    PACKAGE_SALE_PRICE            Num        8    18.2           18.2           PACKAGE_SALE_PRICE
            85    PACKAGE_SALE_RATIO            Num        8    12.2           12.2           PACKAGE_SALE_RATIO
             1    PARCEL_NUM                    Char      20    $20.           $20.           PARCEL_NUM
             2    PARCEL_PK                     Num        8    11.            11.            PARCEL_PK
            32    PARCEL_STATUS                 Char       8    $8.            $8.            PARCEL_STATUS
             5    PIN_EXT                       Char       4    $4.            $4.            PIN_EXT
            28    PIN_MAP                       Char      10    $10.           $10.           PIN_MAP
             4    PIN_NUM                       Char      14    $14.           $14.           PIN_NUM
            43    PLANNING_JURISDICTION         Char       8    $8.            $8.            PLANNING_JURISDICTION
           132    PLANNING_LANDUSE_ABBREV       Char       3    $3.            $3.            PLANNING_LANDUSE_ABBREV
           131    PLANNING_LANDUSE_CODE         Char       1    $1.            $1.            PLANNING_LANDUSE_CODE
           133    PLANNING_LANDUSE_DESC         Char      80    $80.           $80.           PLANNING_LANDUSE_DESC
            67    PROPERTY_DESC                 Char     250    $250.          $250.          PROPERTY_DESC
            87    PROPERTY_INDEX                Char      75    $75.           $75.           PROPERTY_INDEX
           119    REVAL_ESTIMATED_VALUE         Num        8    11.            11.            REVAL_ESTIMATED_VALUE
           118    REVAL_FINAL_REVIEW_STATUS     Char       3    $3.            $3.            REVAL_FINAL_REVIEW_STATUS
           120    REVAL_REVIEWED_BY             Char      10    $10.           $10.           REVAL_REVIEWED_BY
           121    REVAL_REVIEWED_DATE           Num        8    DATETIME20.    DATETIME20.    REVAL_REVIEWED_DATE
           122    REVAL_TOTAL_PROP_VALUE        Num        8    11.            11.            REVAL_TOTAL_PROP_VALUE
           123    REVAL_VCS_REVIEW_STATUS       Char     250    $250.          $250.          REVAL_VCS_REVIEW_STATUS
           107    ROOF_FLOOR_SYSTEM             Char      12    $12.           $12.           ROOF_FLOOR_SYSTEM
            47    SPECIAL_DISTRICT              Char       8    $8.            $8.            SPECIAL_DISTRICT
           111    SPECIAL_PROPERTY_FLAGS        Char     500    $500.          $500.          SPECIAL_PROPERTY_FLAGS
            31    STORY_HEIGHT                  Char       8    $8.            $8.            STORY_HEIGHT
            75    STREET_MISC                   Char       6    $6.            $6.            STREET_MISC
            72    STREET_NAME                   Char      25    $25.           $25.           STREET_NAME
            70    STREET_NUM                    Num        8    7.             7.             STREET_NUM
            73    STREET_TYPE                   Char       4    $4.            $4.            STREET_TYPE
           138    TAXRELIEF_VALUE               Num        8    13.            13.            TAXRELIEF_VALUE
             7    TOTAL_BUILDINGS               Num        8    4.             4.             TOTAL_BUILDINGS
            29    TOTAL_HEATED_AREA             Num        8                                  TOTAL_HEATED_AREA
           116    TOTAL_LANDLINES               Num        8    4.             4.             TOTAL_LANDLINES
            30    TOTAL_LIVING_UNITS            Num        8                                  TOTAL_LIVING_UNITS
            56    TOTAL_OBLDG_VALUE_ASSD        Num        8    10.            10.            TOTAL_OBLDG_VALUE_ASSD
           117    TOTAL_OUT_BUILDINGS           Num        8    4.             4.             TOTAL_OUT_BUILDINGS
            38    TOTAL_PROP_VALUE_ASSD         Num        8    13.            13.            TOTAL_PROP_VALUE_ASSD
            78    TOTAL_VALUE_PER_SQ_FT         Num        8    11.            11.            TOTAL_VALUE_PER_SQ_FT
           125    TOWNHOUSE_DETACHED            Char       3    $3.            $3.            TOWNHOUSE_DETACHED
           124    TOWNHOUSE_END_UNIT            Char       3    $3.            $3.            TOWNHOUSE_END_UNIT
            44    TOWNSHIP                      Char       8    $8.            $8.            TOWNSHIP
            45    TOWNSHIP_DESC                 Char      50    $50.           $50.           TOWNSHIP_DESC
             6    VCS                           Char       7    $7.            $7.            VCS
            40    YEAR_BUILT                    Num        8    5.             5.             YEAR_BUILT
            76    ZIP                           Char       5    $5.            $5.            ZIP
            24    ZONING                        Char      10    $10.           $10.           ZONING
           109    ZONING_CATEGORY               Char      20    $20.           $20.           ZONING_CATEGORY

>>> sd = sas.sasdata('chris','x')
>>> sd; sd.obs()
Libref  = x
Table   = chris
Dsopts  = {}
Results = Pandas

50000
>>>
>>> tracemalloc.start()
>>> df = sd.to_df()
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>>
>>>
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy =  1351015
>>>
>>>
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas =  18165160304046
>>>
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 139 entries, PARCEL_NUM to NET_BILLABLE_VALUE
dtypes: datetime64[ns](7), float64(47), object(85)
memory usage: 53.0+ MB
>>>
>>>
>>> tot_all = 0
>>> for stat in top_stats:
...    #print(stat)
...    tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all =  18165776279127
>>>
>>>
AurelienThiery commented 1 year ago

Hi Tom, Ok i send you my SAS table today. Thanks for these explanations ! Aurélien

De : Tom Weber @.> Envoyé : lundi 24 octobre 2022 19:30 À : sassoftware/saspy @.> Cc : THIERY Aurelien @.>; Mention @.> Objet : Re: [sassoftware/saspy] Convert big SAS Data to pandas dataframe take all RAM (Issue #494)

Ok, so I'm learning about this tracemalloc module as I go. Clearly, it's not keeping a running total of memory consumption; it's each malloc (with no subtracting for free's). In this next test case I ran w/ a different 1G dataset that has a mix of char and num and datetimes. This made an enormous difference in the stats. but clearly the 'total' I'm getting from the trace records isn't a cumulative number, sine it shows 18,165,368,718,822 as sum of all the alloc's. Clearly, it was freeing most of this along the way too. I don't see that this module can to a running total, so that's still best observed at the OS layer looking at process statistics.

I did run the 4G version of my previous dataset (all doubles), and it took 9G (total allocs over the duration), though I observed over 2G for the process, just by watching top from another terminal for that python process.

None of this is definitive, but it is showing significant, and kinda strange, memory usage creating the dataframe. In this case below, with different data coming over, saspy had 1.3G of allocations over the duration, while pandas had 18,165G of allocations over the duration. So, there's no telling from this how much at any one time, but clearly, pandas is allocating and reallocating no end of objects and mem while creating the dataframe.

So, again, I think trying this on your data, with reduced number of observations; incrementing and looking at memory consumption, while trying increasingly larger number of observations (and look at the SAS log in case there's something going on there) is the course to try. How pandas uses memory seems to be all over the place from just these couple tries I ran with different kinds of data. The mixed data (not just all doubles) seemed to explode the amount of allocations.

tom64-7> python3

Python 3.9.12 (main, Apr 5 2022, 06:56:58)

[GCC 7.5.0] :: Anaconda, Inc. on linux

Type "help", "copyright", "credits" or "license" for more information.

import saspy

import tracemalloc

sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')

SAS Connection established. Subprocess id is 113946

sas

Access Method = IOM

SAS Config name = iomj

SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py

WORK Path = /sastmp/SAS_work1A5D0001BB01_tom64-7/SAS_workC33A0001BB01_tom64-7/

SAS Version = 9.04.01M8P10232022

SASPy Version = 4.3.5

Teach me SAS = False

Batch = False

Results = Pandas

SAS Session Encoding = utf-8

Python Encoding value = utf_8

SAS process Pid value = 113409

sas.saslib('x', path='/opt/tom')

7 The SAS System Monday, October 24, 2022 12:47:00 PM

34

35 libname x '/opt/tom' ;

NOTE: Libref X was successfully assigned as follows:

  Engine:        V9

  Physical Name: /opt/tom

36

37

38

8 The SAS System Monday, October 24, 2022 12:47:00 PM

39

sd = sas.sasdata('chris','x')

sd; sd.obs()

Libref = x

Table = chris

Dsopts = {}

Results = Pandas

50000

sd.contents('text')

                                                       The SAS System                   Monday, October 24, 2022 12:19:00 PM   4

                                                   The CONTENTS Procedure

         Data Set Name        X.CHRIS                                                  Observations          50000

         Member Type          DATA                                                     Variables             139

         Engine               V9                                                       Indexes               0

         Created              10/04/2017 10:34:35                                      Observation Length    20368

         Last Modified        10/04/2017 10:34:35                                      Deleted Observations  0

         Protection                                                                    Compressed            NO

         Data Set Type                                                                 Sorted                NO

         Label

         Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64

         Encoding             latin1  Western (ISO)

                                             Engine/Host Dependent Information

                                    Data Set Page Size          245760

                                    Number of Data Set Pages    4167

                                    First Data Page             1

                                    Max Obs per Page            12

                                    Obs in First Data Page      10

                                    Number of Data Set Repairs  0

                                    Filename                    /opt/tom/chris.sas7bdat

                                    Release Created             9.0401M4

                                    Host Created                X64_10PRO

                                    Inode Number                412417536

                                    Access Permission           rwxr-xr-x

                                    Owner Name                  sastpw

                                    File Size                   977MB

                                    File Size (bytes)           1024327680

                                        Alphabetic List of Variables and Attributes

         #    Variable                      Type     Len    Format         Informat       Label

        22    ACCRUED_MKT_ADJ_PCT           Num        8    4.             4.             ACCRUED_MKT_ADJ_PCT

        23    ACCRUED_PERCENT               Num        8    4.             4.             ACCRUED_PERCENT

        63    ADDRESS1                      Char     100    $100.          $100.          ADDRESS1

        64    ADDRESS2                      Char     100    $100.          $100.          ADDRESS2

        65    ADDRESS3                      Char     100    $100.          $100.          ADDRESS3

        66    ADDRESS4                      Char     100    $100.          $100.          ADDRESS4

       106    AIR                           Char      12    $12.           $12.           AIR

        14    ALL_GRADES                    Char    1000    $1000.         $1000.         ALL_GRADES

        99    ALL_INTERIOR_FINISHES         Char     500    $500.          $500.          ALL_INTERIOR_FINISHES

        18    ALL_LAND_DESCS                Char    1000    $1000.         $1000.         ALL_LAND_DESCS

        16    ALL_LAND_RATES                Char    1000    $1000.         $1000.         ALL_LAND_RATES

       101    ALL_MEZZANINE_FINISHES        Char     500    $500.          $500.          ALL_MEZZANINE_FINISHES

       108    ALL_ROOF_FLOOR_SYSTEM         Char     500    $500.          $500.          ALL_ROOF_FLOOR_SYSTEM

       127    ALL_ZONINGS                   Char     500    $500.          $500.          ALL_ZONINGS

        95    ATTIC_FINISH                  Char      12    $12.           $12.           ATTIC_FINISH

        96    ATTIC_FINISH_PERCENT          Num        8    4.             4.             ATTIC_FINISH_PERCENT

         3    AUT_SNAPSHOT_DATE             Num        8    DATETIME20.    DATETIME20.    AUT_SNAPSHOT_DATE

        92    BASEMENT                      Char      12    $12.           $12.           BASEMENT

        94    BASEMENT_FINISH               Char      25    $25.           $25.           BASEMENT_FINISH

        93    BASEMENT_PERCENT              Num        8    4.             4.             BASEMENT_PERCENT

       128    BATHS                         Char      25    $25.           $25.           BATHS

        68    BILLING_CLASS                 Num        8                                  BILLING_CLASS

        69    BILLING_CLASS_DESC            Char      20    $20.           $20.           BILLING_CLASS_DESC

        52    BLDG_USE                      Char       8    $8.            $8.            BLDG_USE

        53    BLDG_USE_DESC                 Char      20    $20.           $20.           BLDG_USE_DESC

        37    BLDG_VALUE_ASSD               Num        8    13.            13.            BLDG_VALUE_ASSD

        21    BMC_ADJUSTMENTS               Char    1000    $1000.         $1000.         BMC_ADJUSTMENTS

       126    CONDO_FACTORS                 Char     100    $100.          $100.          CONDO_FACTORS

        41    CORP_LIMIT                    Char       8    $8.            $8.            CORP_LIMIT

        42    CORP_LIMIT_DESC               Char      50    $50.           $50.           CORP_LIMIT_DESC

        39    DEEDED_ACREAGE                Num        8    12.4           12.4           DEEDED_ACREAGE

        57    DEED_BOOK                     Char       6    $6.            $6.            DEED_BOOK

        59    DEED_DATE                     Num        8    DATETIME20.    DATETIME20.    DEED_DATE

        58    DEED_PAGE                     Char       6    $6.            $6.            DEED_PAGE

        27    DEPRECIATION_RATING           Char       8    $8.            $8.            DEPRECIATION_RATING

        54    DESIGN_STYLE                  Char       8    $8.            $8.            DESIGN_STYLE

        55    DESIGN_STYLE_DESC             Char      20    $20.           $20.           DESIGN_STYLE_DESC

        71    DIRECTIONAL_PREFIX            Char       2    $2.            $2.            DIRECTIONAL_PREFIX

        74    DIRECTIONAL_SUFFIX            Char       2    $2.            $2.            DIRECTIONAL_SUFFIX

         8    EFFECTIVE_YEAR                Num        8    5.             5.             EFFECTIVE_YEAR

       134    EXCLUSION_VALUE               Num        8    13.            13.            EXCLUSION_VALUE

        48    EXEMPT_STATUS                 Char       8    $8.            $8.            EXEMPT_STATUS

        49    EXEMPT_STATUS_DESC            Char      20    $20.           $20.           EXEMPT_STATUS_DESC

       135    EXEMPT_VALUE                  Num        8    13.            13.            EXEMPT_VALUE

        97    EXTERIOR_FINISH               Char      12    $12.           $12.           EXTERIOR_FINISH

       129    FIREPLACE_TYPE1               Num        8    4.             4.             FIREPLACE_TYPE1

       130    FIREPLACE_TYPE2               Num        8    4.             4.             FIREPLACE_TYPE2

        46    FIRE_DISTRICT                 Char       8    $8.            $8.            FIRE_DISTRICT

        62    FULL_ADDRESS                  Char    1000    $1000.         $1000.         FULL_ADDRESS

        13    GRADE                         Char       8    $8.            $8.            GRADE

       105    HEAT                          Char      12    $12.           $12.           HEAT

       137    HISTDEF_VALUE                 Num        8    13.            13.            HISTDEF_VALUE

       110    IGNORE_UNITS                  Char       3    $3.            $3.            IGNORE_UNITS

       102    INCOME_INDICATED_VALUE        Num        8    14.2           14.2           INCOME_INDICATED_VALUE

       104    INCOME_PERS_PROPERTY_VALUE    Num        8    13.2           13.2           INCOME_PERS_PROPERTY_VALUE

       103    INCOME_RESIDUAL_LAND_VALUE    Num        8    13.2           13.2           INCOME_RESIDUAL_LAND_VALUE

        98    INTERIOR_FINISH               Char      25    $25.           $25.           INTERIOR_FINISH

       136    LANDDEF_VALUE                 Num        8    13.            13.            LANDDEF_VALUE

        50    LAND_CLASS                    Char       8    $8.            $8.            LAND_CLASS

        51    LAND_CLASS_DESC               Char      20    $20.           $20.           LAND_CLASS_DESC

        19    LAND_CONDITIONS               Char    1000    $1000.         $1000.         LAND_CONDITIONS

        17    LAND_DESC                     Char       8    $8.            $8.            LAND_DESC

        20    LAND_INFLUENCES               Char    1000    $1000.         $1000.         LAND_INFLUENCES

        15    LAND_RATE                     Num        8    16.2           16.2           LAND_RATE

        11    LAND_SALE_DATE                Num        8    DATETIME20.    DATETIME20.    LAND_SALE_DATE

         9    LAND_SALE_PRICE               Num        8    18.2           18.2           LAND_SALE_PRICE

        83    LAND_SALE_PRICE_PER_ACRE      Num        8    11.            11.            LAND_SALE_PRICE_PER_ACRE

        80    LAND_SALE_PRICE_PER_SQ_FT     Num        8    12.2           12.2           LAND_SALE_PRICE_PER_SQ_FT

        84    LAND_SALE_RATIO               Num        8    12.2           12.2           LAND_SALE_RATIO

        86    LAND_TO_BLDG_RATIO            Num        8    12.2           12.2           LAND_TO_BLDG_RATIO

        36    LAND_VALUE_ASSD               Num        8    13.            13.            LAND_VALUE_ASSD

        82    LAND_VALUE_PER_ACRE           Num        8    11.            11.            LAND_VALUE_PER_ACRE

        81    LAND_VALUE_PER_SQ_FT          Num        8    12.2           12.2           LAND_VALUE_PER_SQ_FT

        90    LATEST_LAND_SALE_DATE         Num        8    DATETIME20.    DATETIME20.    LATEST_LAND_SALE_DATE

        88    LATEST_LAND_SALE_PRICE        Num        8    18.2           18.2           LATEST_LAND_SALE_PRICE

        91    LATEST_PACKAGE_SALE_DATE      Num        8    DATETIME20.    DATETIME20.    LATEST_PACKAGE_SALE_DATE

        89    LATEST_PACKAGE_SALE_PRICE     Num        8    18.2           18.2           LATEST_PACKAGE_SALE_PRICE

        77    LOCID                         Num        8    7.             7.             LOCID

        25    MAP_BOOK                      Char       6    $6.            $6.            MAP_BOOK

        26    MAP_PAGE                      Char       6    $6.            $6.            MAP_PAGE

       100    MEZZANINE_FINISH              Char      25    $25.           $25.           MEZZANINE_FINISH

       139    NET_BILLABLE_VALUE            Num        8    13.            13.            NET_BILLABLE_VALUE

        34    OLD_PARCEL_BLOCK              Char       5    $5.            $5.            OLD_PARCEL_BLOCK

        35    OLD_PARCEL_LOT                Char       4    $4.            $4.            OLD_PARCEL_LOT

        33    OLD_PARCEL_MAP                Char       5    $5.            $5.            OLD_PARCEL_MAP

       112    OUTBUILDING_INFO              Char    2500    $2500.         $2500.         OUTBUILDING_INFO

       113    OUTBUILDING_TYPE_AND_GRADE    Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_GRADE

       114    OUTBUILDING_TYPE_AND_RATE     Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_RATE

       115    OUTBUILDING_TYPE_AND_VALUE    Char    2000    $2000.         $2000.         OUTBUILDING_TYPE_AND_VALUE

        60    OWNER1                        Char      50    $50.           $50.           OWNER1

        61    OWNER2                        Char      50    $50.           $50.           OWNER2

        12    PACKAGE_SALE_DATE             Num        8    DATETIME20.    DATETIME20.    PACKAGE_SALE_DATE

        79    PACKAGE_SALE_PER_SQ_FT        Num        8    11.            11.            PACKAGE_SALE_PER_SQ_FT

        10    PACKAGE_SALE_PRICE            Num        8    18.2           18.2           PACKAGE_SALE_PRICE

        85    PACKAGE_SALE_RATIO            Num        8    12.2           12.2           PACKAGE_SALE_RATIO

         1    PARCEL_NUM                    Char      20    $20.           $20.           PARCEL_NUM

         2    PARCEL_PK                     Num        8    11.            11.            PARCEL_PK

        32    PARCEL_STATUS                 Char       8    $8.            $8.            PARCEL_STATUS

         5    PIN_EXT                       Char       4    $4.            $4.            PIN_EXT

        28    PIN_MAP                       Char      10    $10.           $10.           PIN_MAP

         4    PIN_NUM                       Char      14    $14.           $14.           PIN_NUM

        43    PLANNING_JURISDICTION         Char       8    $8.            $8.            PLANNING_JURISDICTION

       132    PLANNING_LANDUSE_ABBREV       Char       3    $3.            $3.            PLANNING_LANDUSE_ABBREV

       131    PLANNING_LANDUSE_CODE         Char       1    $1.            $1.            PLANNING_LANDUSE_CODE

       133    PLANNING_LANDUSE_DESC         Char      80    $80.           $80.           PLANNING_LANDUSE_DESC

        67    PROPERTY_DESC                 Char     250    $250.          $250.          PROPERTY_DESC

        87    PROPERTY_INDEX                Char      75    $75.           $75.           PROPERTY_INDEX

       119    REVAL_ESTIMATED_VALUE         Num        8    11.            11.            REVAL_ESTIMATED_VALUE

       118    REVAL_FINAL_REVIEW_STATUS     Char       3    $3.            $3.            REVAL_FINAL_REVIEW_STATUS

       120    REVAL_REVIEWED_BY             Char      10    $10.           $10.           REVAL_REVIEWED_BY

       121    REVAL_REVIEWED_DATE           Num        8    DATETIME20.    DATETIME20.    REVAL_REVIEWED_DATE

       122    REVAL_TOTAL_PROP_VALUE        Num        8    11.            11.            REVAL_TOTAL_PROP_VALUE

       123    REVAL_VCS_REVIEW_STATUS       Char     250    $250.          $250.          REVAL_VCS_REVIEW_STATUS

       107    ROOF_FLOOR_SYSTEM             Char      12    $12.           $12.           ROOF_FLOOR_SYSTEM

        47    SPECIAL_DISTRICT              Char       8    $8.            $8.            SPECIAL_DISTRICT

       111    SPECIAL_PROPERTY_FLAGS        Char     500    $500.          $500.          SPECIAL_PROPERTY_FLAGS

        31    STORY_HEIGHT                  Char       8    $8.            $8.            STORY_HEIGHT

        75    STREET_MISC                   Char       6    $6.            $6.            STREET_MISC

        72    STREET_NAME                   Char      25    $25.           $25.           STREET_NAME

        70    STREET_NUM                    Num        8    7.             7.             STREET_NUM

        73    STREET_TYPE                   Char       4    $4.            $4.            STREET_TYPE

       138    TAXRELIEF_VALUE               Num        8    13.            13.            TAXRELIEF_VALUE

         7    TOTAL_BUILDINGS               Num        8    4.             4.             TOTAL_BUILDINGS

        29    TOTAL_HEATED_AREA             Num        8                                  TOTAL_HEATED_AREA

       116    TOTAL_LANDLINES               Num        8    4.             4.             TOTAL_LANDLINES

        30    TOTAL_LIVING_UNITS            Num        8                                  TOTAL_LIVING_UNITS

        56    TOTAL_OBLDG_VALUE_ASSD        Num        8    10.            10.            TOTAL_OBLDG_VALUE_ASSD

       117    TOTAL_OUT_BUILDINGS           Num        8    4.             4.             TOTAL_OUT_BUILDINGS

        38    TOTAL_PROP_VALUE_ASSD         Num        8    13.            13.            TOTAL_PROP_VALUE_ASSD

        78    TOTAL_VALUE_PER_SQ_FT         Num        8    11.            11.            TOTAL_VALUE_PER_SQ_FT

       125    TOWNHOUSE_DETACHED            Char       3    $3.            $3.            TOWNHOUSE_DETACHED

       124    TOWNHOUSE_END_UNIT            Char       3    $3.            $3.            TOWNHOUSE_END_UNIT

        44    TOWNSHIP                      Char       8    $8.            $8.            TOWNSHIP

        45    TOWNSHIP_DESC                 Char      50    $50.           $50.           TOWNSHIP_DESC

         6    VCS                           Char       7    $7.            $7.            VCS

        40    YEAR_BUILT                    Num        8    5.             5.             YEAR_BUILT

        76    ZIP                           Char       5    $5.            $5.            ZIP

        24    ZONING                        Char      10    $10.           $10.           ZONING

       109    ZONING_CATEGORY               Char      20    $20.           $20.           ZONING_CATEGORY

sd = sas.sasdata('chris','x')

sd; sd.obs()

Libref = x

Table = chris

Dsopts = {}

Results = Pandas

50000

tracemalloc.start()

df = sd.to_df()

snapshot = tracemalloc.take_snapshot()

top_stats = snapshot.statistics('lineno')

tot_saspy = 0

for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:

... #print(stat)

... tot_saspy += stat.size * stat.count

...

print("saspy = ", str(tot_saspy))

saspy = 1351015

tot_pandas = 0

for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:

... #print(stat)

... tot_pandas += stat.size * stat.count

...

print("pandas = ", str(tot_pandas))

pandas = 18165160304046

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 50000 entries, 0 to 49999

Columns: 139 entries, PARCEL_NUM to NET_BILLABLE_VALUE

dtypes: datetime64ns, float64(47), object(85)

memory usage: 53.0+ MB

tot_all = 0

for stat in top_stats:

... #print(stat)

... tot_all += stat.size * stat.count

...

print("all = ", str(tot_all))

all = 18165776279127

— Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/saspy/issues/494#issuecomment-1289360697, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD5A4KARM5UDTZWD4MJS2LLWE3BRLANCNFSM6AAAAAARJO2FPQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

tomweber-sas commented 1 year ago

Got it! I'm in an all day thing today at work, but I will use this to investigate and see what I can find asap. Thanks! Tom

tomweber-sas commented 1 year ago

Well, I spent some time last night running some comparisons, using that tracemalloc in python and using top -d1 to observe cpu and memory consumption at the process level. tracemalloc output, from what I was trying, shows all of the allocations from the various code running but not free's, so the numbers, when added up, can only be interpreted relative to themselves as an indication of how much overall work (thrashing?) the method is doing with regards to allocating and freeing memory - object creation and reclamation. Seeing the numbers add up to Terabytes and Petabytes would seem crazy if you didn't take this into account.

The table below shows a comparison of using 1 million rows from your data set. I used sas.sd2df() to read it in (which was successful since I didn't run out of mem trying), then I used pandas to write the df out to a csv file (with it's defaults). I then read that csv file in, using pandas to compare the consumption that pandas used without me in the equation. That's about as equal as I could make this. I also ran sas.sd2df() and pandas.read_csv() from a fresh Python session, of course, so they would be as equal as can be also.

What's shown by this is that the dataframe, regardless of how it was created ends up taking 2.1G of memory once it's finished being created. In both cases, the highest resident memory consumed while creating the dataframe was over 11G. The total number of allocation by pandas (again, ignoring the free's) was .8/.9 Petabytes for both cases. I don't know pandas enough to do anything but speculate, but what this says to me is that there is a lot of object creation and re-creation going on while pandas is parsing the data and generating the dataframe. Also, it shows that it's pretty much equal regardless of if saspy is in the mix or not.

1M rows from mvtpra4 as either SAS dataset or csv file
one using saspy other only pandas by itself w/ it's own csv (created via: df.to_csv() )

                                                                these are from tracemalloc
                      final DF mem    process resident mem      saspy    pandas     other
                      (df.info())     (highest observed)
sas.sd2df()           2.1G            11.2G                     4.6G     834.5T     5.2G

pandas.read_csv()     2.1G            11.4G                     N/A      976.1T     49.418P

The next test I was going to do was to try to use tracemalloc in my code and try to do some of the example to try and get relative memory consumption at the pandas read level, hoping to try and get better info from that tool. However, since I observed the same memory consumption by pandas regardless of if it was from using sas.sd2df or pandas.read_csv, it seems moot. Pandas is consuming the memory, many times the final size of the dataframe. saspy made no appreciable difference in that.

So, I'm not sure that there's anything I can do to make this behave better. Just seems to be how pandas behaves. Here's the runs I did to get this information, for the table above.

tom64-7> python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> import tracemalloc
>>> sas = saspy.SASsession(cfgname='iomj', iomhost='localhost', authkey='unix')
SAS Connection established. Subprocess id is 128294

>>> sas
Access Method         = IOM
SAS Config name       = iomj
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_workBE2B0001F4D1_tom64-7/SAS_work009A0001F4D1_tom64-7/
SAS Version           = 9.04.01M8P10252022
SASPy Version         = 4.3.5
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 128209

>>> sas.saslib('x', path='/opt/tom')

7                                                          The SAS System                      Tuesday, October 25, 2022 09:11:00 PM

34
35         libname x    '/opt/tom'  ;
NOTE: Libref X was successfully assigned as follows:
      Engine:        V9
      Physical Name: /opt/tom
36
37
38

8                                                          The SAS System                      Tuesday, October 25, 2022 09:11:00 PM

39
>>> sd = sas.sasdata('mvtpra4','x')
>>> sd; sd.obs()
Libref  = x
Table   = mvtpra4
Dsopts  = {}
Results = Pandas

7905149
>>> sd.dsopts={'obs':1000000}
>>> sd; sd.obs()
Libref  = x
Table   = mvtpra4
Dsopts  = {'obs': 1000000}
Results = Pandas

1000000
>>> tracemalloc.start()
>>> df = sd.to_df()
11.1G res
>>> #df = pandas.read_csv('/opt/tom/mvtpra4_1m.csv')
>>>
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>>
>>> 11.1G res
  File "<stdin>", line 1
    11.1G res
        ^
SyntaxError: invalid syntax
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> tot_saspy = 0
>>> for stat in [s for s in top_stats if 'saspy' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_saspy += stat.size * stat.count
...
>>> print("saspy = ", str(tot_saspy))
saspy =  4637584
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
...    #print(stat)
...    tot_pandas += stat.size * stat.count
...
>>> print("pandas = ", str(tot_pandas))
pandas =  834484755739372
>>>
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> tot_all = 0
>>> for stat in top_stats:
...    #print(stat)
...    tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all =  834537123615182
>>>
>>>
>>> 834537123615182 - 834484755739372
52367875810
>>> 52,367,875,810
(52, 367, 875, 810)
>>> 834,484,755,739,372
(834, 484, 755, 739, 372)
>>>

>>> df
[...]
[1000000 rows x 281 columns]
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 281 entries, SCT to MAJO_FRANC_CLIENT
dtypes: datetime64[ns](32), float64(62), object(187)
memory usage: 2.1+ GB
>>>
>>> df.to_csv("/opt/tom/mvtpra4_1m.csv")     # use pandas to write it out to it's .csv file to read back in next
>>>
tom64-7> ll /opt/tom/mvtpra4_1m.csv
-rw-r--r-- 1 sastpw r&d 1101778094 Oct 25 23:43 /opt/tom/mvtpra4_1m.csv
tom64-7>
tom64-7> ll -h /opt/tom/mvtpra4_1m.csv
-rw-r--r-- 1 sastpw r&d 1.1G Oct 25 23:43 /opt/tom/mvtpra4_1m.csv
tom64-7>

tom64-7> python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import tracemalloc
>>> tracemalloc.start()
>>> #df = sd.to_df()
>>> df = pandas.read_csv('/opt/tom/mvtpra4_1m.csv')
<stdin>:1: DtypeWarning: Columns (11,21,29,39,57,58,59,62,63,67,73,79,81,82,83,88,89,91,94,96,
102,104,109,110,111,118,132,134,135,136,137,146,160,161,162,165,183,209,210,211,212,213,214,215,
216,217,221,222,227,237,261) have mixed types. Specify dtype option on import or set low_memory=False.
at 6G
>>>
>>> snapshot = tracemalloc.take_snapshot()
11.4G
>>> top_stats = snapshot.statistics('lineno')
>>>
>>> at 6G
  File "<stdin>", line 1
    at 6G
       ^
SyntaxError: invalid syntax
>>> 11.4G
  File "<stdin>", line 1
    11.4G
        ^
SyntaxError: invalid syntax
>>>
>>> tot_pandas = 0
>>> for stat in [s for s in top_stats if 'pandas' in s.traceback._frames[0][0]]:
   #print(stat)
   tot_pandas += stat.size * stat.count
...    #print(stat)
...    tot_pandas += stat.size * stat.count
...
print("pandas = ", str(tot_pandas))

>>> print("pandas = ", str(tot_pandas))
tot_all = 0
pandas =  976121237536279
>>>
for stat in top_stats:
   #print(stat)
>>> df.info()
   tot_all += stat.size * stat.count

print("all = ", str(tot_all))

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 282 entries, Unnamed: 0 to MAJO_FRANC_CLIENT
dtypes: float64(91), int64(28), object(163)
memory usage: 2.1+ GB
>>>
>>>
>>> tot_all = 0
>>> for stat in top_stats:
...    #print(stat)
...    tot_all += stat.size * stat.count
...
>>> print("all = ", str(tot_all))
all =  50394662513462493
>>>
>>>
>>> 50,394,662,513,462,493
(50, 394, 662, 513, 462, 493)
>>>
>>> 50394662513462493 - 976121237536279
49418541275926214
>>> 49,418,541,275,926,214
(49, 418, 541, 275, 926, 214)
>>>
>>> 976,121,237,536,279
(976, 121, 237, 536, 279)
>>>
tomweber-sas commented 1 year ago

@AurelienThiery we're you able to see the same thing on your end? I'm at a loss as to how I can influence Pandas to consume less memory while it instantiates the dataframe. I did try specifying different engine= values. Python made it way worse, C is what I use, and the default. And the new experimental pyarrow did keep memory way down but ran so long I just killed it. I didn't see anything else that would help.

Any thoughts? I'm not sure what I can do.

Thanks, Tom

tomweber-sas commented 1 year ago

Hey @AurelienThiery, I'm gonna close this, just to clean things up. I can't see anything I can do to make Pandas not use so much memory, and that's the same when saspy isn't even involved. Let me know if you have any other thoughts, or want to investigate more.

Thanks, Tom