os-climate / data-platform-demo

Apache License 2.0
3 stars 7 forks source link

suggest using tmp files instead of io.BytesIO #31

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

While working with EPA data (which has a 1.8GB FACILITY file and a 960MB ORGANIZATION file), I discovered the downside of using io.BytesIO: it uses a lot of memory, which can be a challenge for 4GB notebooks. The workaround I discovered was to use a temporary file:

    bObj = bucket.Object(f'EPA/national_combined-20211104/NATIONAL_{name}_FILE.CSV')
    bObj.download_file(f'/tmp/foo{timestamp}.csv')

By not loading these big files directly into memory, the restricted versions of the table I need for processing fit easily, even without deleting the processed dataframe after loading it into Trino:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4860615 entries, 0 to 4860614
Data columns (total 17 columns):
 #   Column                Dtype         
---  ------                -----         
 0   REGISTRY_ID           string        
 1   PGM_SYS_ACRNM         string        
 2   INTEREST_TYPE         string        
 3   AFFILIATION_TYPE      string        
 4   START_DATE            datetime64[ns]
 5   END_DATE              datetime64[ns]
 6   ORG_NAME              string        
 7   ORG_TYPE              string        
 8   DUNS_NUMBER           string        
 9   DIVISION_NAME         string        
 10  EIN                   string        
 11  MAILING_ADDRESS       string        
 12  SUPPLEMENTAL_ADDRESS  string        
 13  CITY_NAME             string        
 14  STATE_CODE            string        
 15  POSTAL_CODE           string        
 16  COUNTRY_NAME          string        
dtypes: datetime64[ns](2), string(15)
memory usage: 630.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4510170 entries, 0 to 4510169
Data columns (total 12 columns):
 #   Column            Dtype  
---  ------            -----  
 0   REGISTRY_ID       string 
 1   PRIMARY_NAME      string 
 2   LOCATION_ADDRESS  string 
 3   CITY_NAME         string 
 4   COUNTY_NAME       string 
 5   STATE_CODE        string 
 6   COUNTRY_NAME      string 
 7   POSTAL_CODE       string 
 8   HUC_CODE          string 
 9   PGM_SYS_ACRNMS    string 
 10  LATITUDE83        float32
 11  LONGITUDE83       float32
dtypes: float32(2), string(10)
memory usage: 378.5 MB

I suggest we encourage using tempfiles and then deleting them:

    os.unlink(f'/tmp/foo{timestamp}.csv')