jessegrabowski / tsdisagg

Tools for converting low time series data to high frequency
MIT License
4 stars 2 forks source link

Chow-lin backcasting issue #9

Open steveshaoucsb opened 2 months ago

steveshaoucsb commented 2 months ago

In the case when both high-frequency data and low-frequency data are on the same frequency, but I want to use chow-lin to backcast to some earlier date when there is less data available by leveraging the later data to get a more precise estimate of Beta in chow-lin(For example, high-frequency data has monthly data available from 1987-12-01 to 2021-12-01; low-frequency data has monthly data available on 1987-12-01, and from 2001-12-01 to 2021-12-01(inclusive), but has no data available between 1988-01-01 to 2001-11-01(inclusive), the package will not work because the inferred_freq will return None. In "tempdisagg" package in R, this case works!

jessegrabowski commented 2 months ago

If you do low_freq_df.resample('MS').first() (this will put it into monthly starting and fill the missing values with NaNs) then pass that in, does it work?

steveshaoucsb commented 2 months ago

The program will crash due to the following error:

Cell In[17], [line 72](vscode-notebook-cell:?execution_count=17&line=72)
     [70](vscode-notebook-cell:?execution_count=17&line=70) hf_data = data2[['S7_Total']][disagg_start_date:disagg_end_date]
     [71](vscode-notebook-cell:?execution_count=17&line=71) lf_data = lf_data.resample('MS').first()
---> [72](vscode-notebook-cell:?execution_count=17&line=72) disagg_result = disaggregate_series(
     [73](vscode-notebook-cell:?execution_count=17&line=73)                     lf_data,
     [74](vscode-notebook-cell:?execution_count=17&line=74)                     hf_data.assign(intercept=1),
     [75](vscode-notebook-cell:?execution_count=17&line=75)                     method="chow-lin",
     [76](vscode-notebook-cell:?execution_count=17&line=76)                     agg_func="first",
     [77](vscode-notebook-cell:?execution_count=17&line=77)                     optimizer_kwargs={"method": "powell"},
     [78](vscode-notebook-cell:?execution_count=17&line=78)                     ).to_frame(name='K')
     [79](vscode-notebook-cell:?execution_count=17&line=79) data2.loc[disagg_result.index, 'K'] = disagg_result.values.squeeze()

File /opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:335, in disaggregate_series(low_freq_df, high_freq_df, target_freq, target_column, agg_func, method, criterion, h, optimizer_kwargs, verbose, return_optimizer_result)
    [332](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:332) target_column = target_column or low_freq_df.columns[0]
    [333](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:333) target_idx = np.flatnonzero(low_freq_df.columns == target_column)[0]
--> [335](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:335) df, C_mask, time_conversion_factor = prepare_input_dataframes(
    [336](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:336)     low_freq_df, high_freq_df, target_freq, method
    [337](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:337) )
    [339](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:339) y = df.iloc[:, target_idx].dropna().values
    [340](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:340) X = df.drop(columns=df.columns[target_idx]).values

File /opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:163, in prepare_input_dataframes(df1, df2, target_freq, method)
    [158](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:158)     raise ValueError(
...
--> [163](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:163)     raise ValueError("low_freq_df has missing values.")
    [165](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:165) if df2 is not None:
    [166](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.12/site-packages/tsdisagg/ts_disagg.py:166)     if not isinstance(df2.index, pd.core.indexes.datetimes.DatetimeIndex):

ValueError: low_freq_df has missing values.
jessegrabowski commented 2 months ago

Ok great. Let me look into what the R program is doing in the case. My assumption is that they mask out the missing values, fit the data, then fill in the missing value with predictions from the fit model, but I'll need to check.

jessegrabowski commented 2 months ago

This might work in 1.3, but I don't specifically have a test for it. Have a test and let me know.

steveshaoucsb commented 2 months ago

Just tested. It didn't got fixed. If I do lf_data = lf_data.resample('MS').first(), the program will tell me there is missing value and cannot proceed. If I remove the missing value, this error will occur: ValueError: Low frequency dataframe does not have a valid time index with frequency information

jessegrabowski commented 2 months ago

Ok thanks for checking. I think I know how to handle it.

steveshaoucsb commented 2 months ago

Any updates, or do you need any more explanation of the issue?