unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.87k stars 851 forks source link

NBEATS :: RAM consumption goes on increasing while predicting #2166

Closed hberande closed 7 months ago

hberande commented 7 months ago

I am using NBEATS model for forecasting. Since we are in a trial phase we are giving past data and generating forecast from the same and we are repeating these steps using a for-loop. We are predicting 1 year data having 15 min interval of time. So code is picking approx 35000 csv files using for loop & predicting their forecast. But doing that, my Work station RAM goes on increasing & reaches 100% after predicting apprx 10,000 files & code gets interrupted.

I am using - Python version: [e.g. 3.11.5 & darts version [e.g. 0.27.1] System Configuration - 12th Gen Intel(R) Core(TM) i7-12700K 3.60 GHz, 64.0 GB (63.7 GB usable)

Additional context I am facing these issue only with NBEATS Model & not with the TCN Model.

dennisbader commented 7 months ago

Hi @hberande, we'd need some more information on how you perform prediction. Could you provide a minimal reproducible example?

hberande commented 7 months ago

Refer below code which is suitable for our forecasting project.

  1. Training the data - The Training date is selected till 2022-12-31 19:15:00.
  2. Once the training is completed we are calling 1 csv file which contains data from 2022-12-31 19:15:00 to 2022-12-31 20:45:00. i.e Next 1.5 hrs data of 15 min interval of time.
  3. Above csv file gets appended with main data file & gives prediction/forecast from 2022-12-31 21:00:00 to 2022-12-31 22:30:00.
  4. In next iteration, 2nd file from the directory gets selected which is from 2022-12-31 21:00:00 to 2022-12-31 22:30:00. & gives prediction/forecast from 2022-12-31 22:30:00 to 2023-01-01 00:00:00 and cycle repeats till 31-12-2023 23:45:00.
  5. Hence we get forecast every after 15 hrs.

Code:- import numpy as np import pandas as pd import os from darts import TimeSeries from darts.models import TCNModel,NBEATSModel,NHiTSModel from darts.dataprocessing.transformers import Scaler from darts.utils.timeseries_generation import datetime_attribute_timeseries from darts.dataprocessing.transformers import MissingValuesFiller

site_info = {'Amidyala': [56700,108]}

dir_path = r"N:\WRD\W&SCCC\Forecast\Harshad_FCST\0.WindFarm_SCADA_data\2.SCADA_15min"

df_1 = pd.DataFrame()

for file_name in os.listdir(dir_path):

forecast_tcn   = pd.DataFrame()
df_forecast    = pd.DataFrame()

if file_name.split('_')[0] in site_info:

    ###########################################( Data Reading) ###################################################
    df_scada   = pd.read_excel(dir_path + "\\" + file_name) 

    train_date = int(df_scada[df_scada['Date'] == "2022-12-31 19:15:00"].index[0])

    df_scada   = df_scada.loc[:train_date]

    ###########################################(Data PreProcessing) ##############################################
    value_filler  = MissingValuesFiller()
    series        = value_filler.transform(TimeSeries.from_dataframe(df_scada, 'Date', ['SCADA_WS_Avg',
                    'SCADA_Power_Avg [kwh]','SCADA_Power_Sum [kwh]','No.Of Turbines','SCADA_Temp_Avg',
                    'SCADA_Dir_Avg'],freq='15T'))

    df_1            = series.pd_dataframe()
    df_1.reset_index('Date',inplace=True)

    ###########################################(Feature Selection) ##############################################
    target        = df_1[['Date','SCADA_Power_Sum [kwh]']]
    WS_cov        = df_1[['Date','SCADA_WS_Avg','SCADA_Dir_Avg','SCADA_Temp_Avg']]

    series        = TimeSeries.from_dataframe(target, time_col='Date',
                    value_cols=['SCADA_Power_Sum [kwh]'],freq='15T')

    WS_series     = TimeSeries.from_dataframe(WS_cov, time_col='Date',
                    value_cols=['SCADA_WS_Avg','SCADA_Dir_Avg','SCADA_Temp_Avg'],freq='15T')

    def encoder_year(idx):
        return (idx.year - 1950)/50

    add_encoders = {'cyclic'  :{'past' : ['month']}     ,'datetime_attribute':{'past':['hour','month']},
                    'position':{'past' : ['relative']   ,'future'            :['relative'] },
                    'custom'  :{'past' : [encoder_year]},'transformer'       :Scaler()}

    ###########################################(Feature Scaling) ##############################################

    scaler_power  = Scaler()
    scaler_WS     = Scaler()
    series_scaled = scaler_power.fit_transform(series)
    cov_scaled    = scaler_WS.fit_transform(WS_series)

    train_scaled  = series_scaled
    past_train_cov= cov_scaled

    ######################################(Model Calling & Training) ##########################################

    model_nbeats = NBEATSModel(input_chunk_length = 96 , output_chunk_length = 24, generic_architecture = True,
                               num_stacks         = 6  , num_blocks          = 10, num_layers           = 10,
                               layer_widths       = 512, n_epochs            = 100 , batch_size           = 800,
                               add_encoders       = add_encoders )

    model_nbeats.fit(series = train_scaled, past_covariates = past_train_cov, verbose = True)

    ###########################################(Model Saving) ################################################

    os.chdir(r'D:\4. Forecast Results\1. NBEATS\FCST_NB_2023\Trained Model')
    model_nbeats.save((file_name.split('_')[0] + ".pt"))

    ###################################(Specify Testing Files Path) ##########################################

    dir_path_1 = r"D:\3. SCADA 5 Years data\Testing Files_2023"

    ########################(For Loop To Read Testing Files from Specified Path)###############################
    for file_name1 in os.listdir(dir_path_1):
        if file_name1.split('_')[0] == file_name.split('_')[0] in site_info:

            ###################################### (Read Testing File) #######################################

            test_file    = pd.read_csv(dir_path_1 + "\\" + file_name1)

            ############################# (Concat testing file with Earlier SCADA) ###########################
            df_1         = pd.concat([df_1,test_file])

            df_1['Date'] = pd.to_datetime(df_1.Date, format = "%Y-%m-%d %H:%M:%S")

            df_1         = df_1[['Date','SCADA_WS_Avg','SCADA_Power_Avg [kwh]','SCADA_Power_Sum [kwh]'
                            ,'No.Of Turbines','SCADA_Temp_Avg','SCADA_Dir_Avg']]

            df_1.drop_duplicates(subset = 'Date', keep='last',inplace=True)

            ###################################### (Data Preposcessing) ######################################
            value_filler  = MissingValuesFiller()
            series        = value_filler.transform(TimeSeries.from_dataframe(df_1, 'Date', ['SCADA_WS_Avg',
                            'SCADA_Power_Avg [kwh]','SCADA_Power_Sum [kwh]','No.Of Turbines','SCADA_Temp_Avg',
                            'SCADA_Dir_Avg'],freq='15T'))
            df_2          = series.pd_dataframe()
            df_2.reset_index('Date',inplace=True)

            target        = df_2[['Date','SCADA_Power_Sum [kwh]']]

            WS_cov        = df_2[['Date','SCADA_WS_Avg','SCADA_Dir_Avg','SCADA_Temp_Avg']]

            series        = TimeSeries.from_dataframe(target, time_col='Date',
                            value_cols=['SCADA_Power_Sum [kwh]'],freq='15T')

            WS_series     = TimeSeries.from_dataframe(WS_cov, time_col='Date',
                            value_cols=['SCADA_WS_Avg','SCADA_Dir_Avg','SCADA_Temp_Avg'],freq='15T')

            ###########################################(Feature Scaling) ######################################

            scaler_power  = Scaler()
            scaler_WS     = Scaler()
            series_scaled = scaler_power.fit_transform(series)
            cov_scaled    = scaler_WS.fit_transform(WS_series)

            ###########################################(Forecasting) #########################################

            pred          = model_nbeats.predict(series = series_scaled, n = 18, past_covariates = cov_scaled)
            pred          = scaler_power.inverse_transform(pred)
            df_pred       = pred.pd_dataframe()
            df_pred.reset_index('Date',inplace=True)
            df_pred.rename(columns={'SCADA_Power_Sum [kwh]':'FCST_Sum_NBEATS'},inplace=True)

            forecast_tcn  = pd.concat([forecast_tcn,df_pred.iloc[0:].tail(int(1.5*4))])

            os.chdir(r'D:\4. Forecast Results\1. NBEATS\FCST_NB_2023\1.5 Hrs')

            ###########################################(Saving Forecast Files) ###############################

            df_pred.to_csv(file_name.split('_')[0] + "_" + df_pred['Date'].iloc[12].strftime('%Y%m%d%H%M') + "_NB.csv")

            ###########################################(Saving File Merged With SCADA File) ##################
    os.chdir(r'D:\4. Forecast Results\1. NBEATS\FCST_NB_2023\Merged Files')
    df_forecast = pd.merge(df_scada,forecast_tcn,on='Date')
    df_forecast.to_excel(file_name.split('_')[0] + "_NBEATS_2023" + ".xlsx")
dennisbader commented 7 months ago

This is not really a minimal reproducible example ;)

From a first glance I could imagine that it comes from this line:

df_1         = pd.concat([df_1,test_file])

This iteratively increases the size of your prediction dataset. With this you also recompute predictions that you already generated in the previous iteration. Instead you can just generate predictions for test_file, and then concatenate only the predictions.

Also you re-fit the data transfomers (scaler) on data that it has not seen during model training. This is not a good practice, as the values transformed during predictions do not represent the same value range as the values during training.

You should instead use the fitted data transformer from training and then only transform the input for prediction.