[BUG]: time_series.prediction() returns NaN

turkalpmd commented 2 years ago

pycaret version checks

[X] I have checked that this issue has not already been reported here.
[X] I have confirmed this bug exists on the latest version of pycaret.
[x] I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).

Issue Description

Hi everyone and dear contributors,
I wanted to write a time series code with PyCaret for the Tabular Playground Series - Sep 2022 competition on Kaggle. Because that seemed easy to me. I wanted to divide it into 48 different attributes and forecast the num_sold values of individual attributes, so I wrote a code like this.

Data Source doesn't have any NaN Some unique df returns with NaN values but someone doesn't. Why I can't understand that?

Example of them Each line is representing to iteration I used .isna().sum() function for showing missing values .

[array([234]), array([0]), array([19]), array([181]), array([0]), array([0]), array([225]), array([0]), array([302]), array([0]), array([0]), array([0]), array([248]), array([0]), array([267]), array([0]), array([107]), array([53]), array([0]), array([0]), array([308]), array([0]), array([0]), array([235]), array([257]), array([0]), array([279]), array([0]), array([0]), array([0]), array([0]), array([223]), array([0]), array([0]), array([238]), array([199]), array([301]), array([0]), array([0]), array([73]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0])]

Thank you.

Reproducible Example

data source = https://www.kaggle.com/competitions/tabular-playground-series-sep-2022

Expected Behavior

* import libraries

!pip install --pre pycaret

import pandas as pd
from pycaret.time_series import *

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', 500)

* Data Prep.

train = pd.read_csv("../input/tabular-playground-series-sep-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2022/test.csv")
sub = pd.read_csv("../input/tabular-playground-series-sep-2022/sample_submission.csv")

# Creating the concatenated group to forecast
train['ID'] = train['store'] + '-' + train['product'] + '-' + train['country']
test['ID'] = test['store'] + '-' + test['product'] + '-' + test['country']
train.drop(columns=["country","store","product"],inplace=True)
test.drop(columns=["country","store","product"],inplace=True)

* Main Code

# Create empty DataFrame for concatenate

result_df = pd.DataFrame()

# We created unique attributes, we will iterating with this list
object_list = list(train['ID'].unique())

for i in tqdm(range(len(object_list)):
    # Create unique dataset by unique ID

    unique_df = train[train.ID == train['ID'].unique()[i]]

   # for validation dataset - before the submission I choose 2020 for trying this code.

    val_df = unique_df[unique_df["date"] >= '2020-01-01']#.str.contains("2020")].reset_index(drop=True)
    train_df = unique_df[unique_df["date"] < '2020-01-01']#.str.contains("2020")]

    # Maybe here is false but we expecting to 366 prediction values for each unique_df
    # Shortyly I assumed that for predicting the forecasting horizon for this case
    result_lenght = val_df.shape[0]

    # When setting 365, the setup module returns an error about size, I'm also trying 1,3,7,15,25
     setup_fh = 7

    # Setup for pycaret
    setup(data = train_df,
          target = "num_sold",
          verbose=False, #--> Another BUG but it is okay
          index="date", # Very clever think, I don't need transforming with pd.datetime, index etc.
          ignore_features= ["row_id","ID"], # Also clever
          transform_target= "box-cox", # Very nice 
          fold = 1, # For the time saving for me :D
          fh = setup_fh, # Default: 1, But I want to modular function because of I will create def with this code
          session_id = 42, # mean of universe)
    # Maybe best way, compare models and choose best 3 and blend with them all. But I want to use prophet
    # Tuning of Hyperparamters
    tuned_model = tune_model(create_model('prophet',
                                           cross_validation=True,
                                           fold=1,
                                           verbose=False),
                             fold=1,
                             optimize="SMAPE", # For competition
                             search_algorithm="random",
                             choose_better=True, # I don't want overfitting but firstly my code running properly
                             verbose=False)

    finalized_model = finalize_model(tuned_model)

    result = predict_model(finalized_model, fh = result_lenght)

    #result = result[setup_fh:]

   # I need for indexing and verifying
    result["index"] = val_df['row_id'].values
    result["ID"] = val_df['ID'].values
    result["date"] = val_df['date'].values
    result.sort_values(by="index",inplace=True)

    # After concatenating with iter line
    result_df = pd.concat([result_df,result])

Actual Results

My problem is different

Installed Versions

System: python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] executable: /opt/conda/bin/python machine: Linux-5.10.133+-x86_64-with-debian-bullseye-sid PyCaret required dependencies: pip: 22.1.2 setuptools: 59.8.0 pycaret: 3.0.0.rc3 IPython: 7.33.0 ipywidgets: 7.7.1 tqdm: 4.64.0 numpy: 1.21.6 pandas: 1.3.5 jinja2: 3.1.2 scipy: 1.7.3 joblib: 1.1.0 sklearn: 1.0.2 pyod: Installed but version unavailable imblearn: 0.9.0 category_encoders: 2.5.0 lightgbm: 3.3.2 numba: 0.55.2 requests: 2.28.1 matplotlib: 3.5.3 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.10.0 kaleido: 0.2.1 statsmodels: 0.13.2 sktime: 0.11.4 tbats: Installed but version unavailable pmdarima: 2.0.1 psutil: 5.9.1

ngupta23 commented 2 years ago

@turkalpmd Can you produce a minimal set of data and code to reproduce the problem. It is hard to debug with the entire code base and dataset. Thanks!

ngupta23 commented 2 years ago

Also, please break down create_model, tune_model and finalize_model into individual steps so that it is easier for you to also debug and get to the root cause faster (i.e. which step is causing the issue).

turkalpmd commented 2 years ago

Dear Nikhil,

I solved the problem. Transformation feature is doesnt work properly. Maybe box-cox or yoe-johnson create infinity or NaN values. When I close this my problem is solved.

ngupta23 commented 2 years ago

@turkalpmd I would like to fix the issue with the transformations if it truly exists. Will you be able to provide a simplified code example with sample data to reproduce the issue? Thanks!

turkalpmd commented 2 years ago

I can share notebook with you. Is it possible?

ngupta23 commented 2 years ago

I can share notebook with you. Is it possible?

@turkalpmd Yes, you can attach the zip here. Thanks!

turkalpmd commented 2 years ago

I was thinking the tomorrow that the problem might be related to transform_target. Today I saw that the problem is actually related to scale_target. I'm trying both of them in notebook. Scaling Target may not actually be appropriate. I couldn't understand why I was using this. Most of all, thank you very much for developing pycaret.

I have also share on Kaggle with you.

pycaret_time.zip

ngupta23 commented 1 year ago

@turkalpmd I can not debug the issue with such a big notebook. Can you create a minimal example? In any case, I see you are using pycaret-ts-alpha. This has been deprecated. You should switch to the pre release of 3.0.0. Please see here for instructions:

https://github.com/pycaret/pycaret/issues/3018#issuecomment-1272334059

ngupta23 commented 1 year ago

@turkalpmd I am going to close this for now. If you can provide a minimal reproducible example that I can use for debugging, please feel free to reopen this. Thanks!

pycaret / pycaret