Allow missing data in low frequency data

jessegrabowski commented 2 months ago

Closes #9

@steveshaoucsb could you install from this branch and let me know if its doing the right thing? You are now allowed to have low-frequency missing values, and I'm testing it's doing the right thing, but I don't know if it matches the R output.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 65.53%. Comparing base (5d7bf61) to head (d5e08f6).

Files with missing lines	Patch %	Lines
tsdisagg/ts_disagg.py	50.00%	5 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #17 +/- ## ========================================== + Coverage 64.74% 65.53% +0.78% ========================================== Files 4 4 Lines 729 734 +5 ========================================== + Hits 472 481 +9 + Misses 257 253 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

steveshaoucsb commented 2 months ago

Let me have a look tomorrow!(Fyi: my time zone is Singapore time)

steveshaoucsb commented 2 months ago

Just run my test with my low-freq data that has frequency set as "MS", which contains some missing values for those needed to be interpolated. It runs into the following error:

File ~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:393, in disaggregate_series(low_freq_df, high_freq_df, target_freq, target_column, agg_func, method, criterion, h, optimizer_kwargs, verbose, return_optim_res)
    [390](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:390) target_column = target_column or low_freq_df.columns[0]
    [391](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:391) target_idx = np.flatnonzero(low_freq_df.columns == target_column)[0]
--> [393](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:393) df, low_freq_df, high_freq_df, time_conversion_factor = prepare_input_dataframes(
    [394](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:394)     low_freq_df, high_freq_df, target_freq, method
    [395](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:395) )
    [397](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:397) C = build_conversion_matrix(low_freq_df, high_freq_df, time_conversion_factor, agg_func)
    [398](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:398) drop_rows = np.all(C == 0, axis=1) | low_freq_df.isna().values.ravel()

File ~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:274, in prepare_input_dataframes(low_freq_df, high_freq_df, target_freq, method)
    [272](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:272) high_name = get_frequency_name(high_freq)
    [273](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:273) low_name = get_frequency_name(low_freq)
--> [274](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:274) time_conversion_factor = FREQ_CONVERSION_FACTORS[low_name][high_name]
    [276](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:276) var_name, low_freq_name, high_freq_name = make_names_from_frequencies(
    [277](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:277)     low_freq_df_out, high_freq
    [278](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:278) )
    [280](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:280) if isinstance(low_freq_df_out, pd.Series):

KeyError: 'monthly'

jessegrabowski commented 2 months ago

Monthly low-frequency data isn't supported

steveshaoucsb commented 2 months ago

Aight I think I didn't make my issue clear enough. For my data, the high-frequency and low frequency data shares the same frequency, but with the low-frequency data contains some missing date at the earlier date. An example of my data: High frequency:

period,Value
1994/12/1,7012.132445
1995/1/1,7156.293844
1995/2/1,7355.283345
1995/3/1,7437.310734
1995/4/1,7416.313527
1995/5/1,7460.807609
1995/6/1,7566.993484
1995/7/1,7708.174705
1995/8/1,7900.14917
1995/9/1,8078.825404
1995/10/1,7838.857323
1995/11/1,7847.556166
1995/12/1,8000.121333
1996/1/1,8564.460807
1996/2/1,8739.937466
1996/3/1,8928.61237
1996/4/1,9164.480996
1996/5/1,9602.522731
1996/6/1,9818.89395
1996/7/1,9978.272751
1996/8/1,10481.70579
1996/9/1,10385.91853
1996/10/1,10491.30451
1996/11/1,10485.10533
1996/12/1,10888.85163
1997/1/1,11187.31193
1997/2/1,11485.87222
1997/3/1,11493.17125
1997/4/1,11751.63687
1997/5/1,12381.35311
1997/6/1,12528.13358
1997/7/1,12170.98109
1997/8/1,12356.15646
1997/9/1,12512.33568
1997/10/1,12622.52103
1997/11/1,12577.92696
1997/12/1,13005.77005
1998/1/1,12991.97189
1998/2/1,13387.51927
1998/3/1,13496.70475
1998/4/1,13723.97452
1998/5/1,13953.04405
1998/6/1,14105.22381
1998/7/1,14403.68411
1998/8/1,14474.27472
1998/9/1,14564.84967
1998/10/1,14605.31229
1998/11/1,14487.43697
1998/12/1,14859.83943
1999/1/1,14955.74368
1999/2/1,15230.26816
1999/3/1,16369.66361
1999/4/1,16987.51142
1999/5/1,16572.25266
1999/6/1,16279.75457
1999/7/1,16456.35707
1999/8/1,16511.55473
1999/9/1,16603.18554
1999/10/1,17124.46821
1999/11/1,17734.04512
1999/12/1,18216.00002
2000/1/1,18626.44142
2000/2/1,18921.88113
2000/3/1,19779.07011
2000/4/1,20434.60391
2000/5/1,20947.98963
2000/6/1,21106.76351
2000/7/1,22049.31613
2000/8/1,23105.41266
2000/9/1,23532.77481
2000/10/1,24208.14698
2000/11/1,24441.57493
2000/12/1,24409.54919
2001/1/1,24246.62886
2001/2/1,24862.04368
2001/3/1,25465.56195
2001/4/1,26131.34622
2001/5/1,27058.2102
2001/6/1,27464.30456
2001/7/1,28003.41849
2001/8/1,28651.27279
2001/9/1,29436.77722
2001/10/1,30107.91949
2001/11/1,31195.0912
2001/12/1,34940

Low-frequency data:

period,Value
1994/12/1,7012.132445
1995/1/1,
1995/2/1,
1995/3/1,
1995/4/1,
1995/5/1,
1995/6/1,
1995/7/1,
1995/8/1,
1995/9/1,
1995/10/1,
1995/11/1,
1995/12/1,8000.121333
1996/1/1,8564.460807
1996/2/1,8739.937466
1996/3/1,8928.61237
1996/4/1,9164.480996
1996/5/1,9602.522731
1996/6/1,9818.89395
1996/7/1,9978.272751
1996/8/1,10481.70579
1996/9/1,10385.91853
1996/10/1,10491.30451
1996/11/1,10485.10533
1996/12/1,10888.85163
1997/1/1,11187.31193
1997/2/1,11485.87222
1997/3/1,11493.17125
1997/4/1,11751.63687
1997/5/1,12381.35311
1997/6/1,12528.13358
1997/7/1,12170.98109
1997/8/1,12356.15646
1997/9/1,12512.33568
1997/10/1,12622.52103
1997/11/1,12577.92696
1997/12/1,13005.77005
1998/1/1,12991.97189
1998/2/1,13387.51927
1998/3/1,13496.70475
1998/4/1,13723.97452
1998/5/1,13953.04405
1998/6/1,14105.22381
1998/7/1,14403.68411
1998/8/1,14474.27472
1998/9/1,14564.84967
1998/10/1,14605.31229
1998/11/1,14487.43697
1998/12/1,14859.83943
1999/1/1,14955.74368
1999/2/1,15230.26816
1999/3/1,16369.66361
1999/4/1,16987.51142
1999/5/1,16572.25266
1999/6/1,16279.75457
1999/7/1,16456.35707
1999/8/1,16511.55473
1999/9/1,16603.18554
1999/10/1,17124.46821
1999/11/1,17734.04512
1999/12/1,18216.00002
2000/1/1,18626.44142
2000/2/1,18921.88113
2000/3/1,19779.07011
2000/4/1,20434.60391
2000/5/1,20947.98963
2000/6/1,21106.76351
2000/7/1,22049.31613
2000/8/1,23105.41266
2000/9/1,23532.77481
2000/10/1,24208.14698
2000/11/1,24441.57493
2000/12/1,24409.54919
2001/1/1,24246.62886
2001/2/1,24862.04368
2001/3/1,25465.56195
2001/4/1,26131.34622
2001/5/1,27058.2102
2001/6/1,27464.30456
2001/7/1,28003.41849
2001/8/1,28651.27279
2001/9/1,29436.77722
2001/10/1,30107.91949
2001/11/1,31195.0912
2001/12/1,34940

I want to use the later date in my low-frequency data to be included with the high-frequency data in the chow-lin model in order to get a better beta for the regression, and backcasting to fill in the gap in the low frequency data. The tempdisagg package can work on this case. Here is the chow-lin interpolated result for the data above:

steveshaoucsb commented 2 months ago

Any updates, or do you need any more explanation to the problem?

steveshaoucsb commented 2 months ago

Any updates, or do you need any more explanation of the issue?

jessegrabowski commented 2 months ago

Hi,

1) I haven't had any time to work on this over the past few weeks, and that's unlikely to change for at least another 2 weeks. I'm happy to look at a PR if you open one to add the functionality you need, which; 2) I consider a low priority, because it is just a standard regression task with correlated standard errors. You can accomplish what you are trying to do with default statsmodels, because you are not doing anything with disaggregation in this case. I am not convinced this functionality should even be in the package. 3) Please don't spam the repository with identical comments.

steveshaoucsb commented 2 months ago

Thanks for letting me know. I will take a look in the next couple of days on this matter and maybe create a separate pull request for this, considering there are some other higher-priority issues mentioned in this pull request. Apologize for sending the same message on this.

jessegrabowski / tsdisagg

Allow missing data in low frequency data #17

Codecov Report