jessegrabowski / tsdisagg

Tools for converting low time series data to high frequency
MIT License
4 stars 2 forks source link

Allow missing data in low frequency data #17

Open jessegrabowski opened 2 months ago

jessegrabowski commented 2 months ago

Closes #9

@steveshaoucsb could you install from this branch and let me know if its doing the right thing? You are now allowed to have low-frequency missing values, and I'm testing it's doing the right thing, but I don't know if it matches the R output.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 65.53%. Comparing base (5d7bf61) to head (d5e08f6).

Files with missing lines Patch % Lines
tsdisagg/ts_disagg.py 50.00% 5 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #17 +/- ## ========================================== + Coverage 64.74% 65.53% +0.78% ========================================== Files 4 4 Lines 729 734 +5 ========================================== + Hits 472 481 +9 + Misses 257 253 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

steveshaoucsb commented 2 months ago

Let me have a look tomorrow!(Fyi: my time zone is Singapore time)

steveshaoucsb commented 2 months ago

Just run my test with my low-freq data that has frequency set as "MS", which contains some missing values for those needed to be interpolated. It runs into the following error:

File ~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:393, in disaggregate_series(low_freq_df, high_freq_df, target_freq, target_column, agg_func, method, criterion, h, optimizer_kwargs, verbose, return_optim_res)
    [390](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:390) target_column = target_column or low_freq_df.columns[0]
    [391](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:391) target_idx = np.flatnonzero(low_freq_df.columns == target_column)[0]
--> [393](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:393) df, low_freq_df, high_freq_df, time_conversion_factor = prepare_input_dataframes(
    [394](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:394)     low_freq_df, high_freq_df, target_freq, method
    [395](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:395) )
    [397](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:397) C = build_conversion_matrix(low_freq_df, high_freq_df, time_conversion_factor, agg_func)
    [398](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:398) drop_rows = np.all(C == 0, axis=1) | low_freq_df.isna().values.ravel()

File ~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:274, in prepare_input_dataframes(low_freq_df, high_freq_df, target_freq, method)
    [272](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:272) high_name = get_frequency_name(high_freq)
    [273](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:273) low_name = get_frequency_name(low_freq)
--> [274](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:274) time_conversion_factor = FREQ_CONVERSION_FACTORS[low_name][high_name]
    [276](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:276) var_name, low_freq_name, high_freq_name = make_names_from_frequencies(
    [277](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:277)     low_freq_df_out, high_freq
    [278](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:278) )
    [280](https://file+.vscode-resource.vscode-cdn.net/Users/steveshao/Desktop/Data_science/tsdisagg/tsdisagg/~/Desktop/Data_science/tsdisagg/tsdisagg/ts_disagg.py:280) if isinstance(low_freq_df_out, pd.Series):

KeyError: 'monthly'
jessegrabowski commented 2 months ago

Monthly low-frequency data isn't supported

steveshaoucsb commented 2 months ago

Aight I think I didn't make my issue clear enough. For my data, the high-frequency and low frequency data shares the same frequency, but with the low-frequency data contains some missing date at the earlier date. An example of my data: High frequency:

period,Value
1994/12/1,7012.132445
1995/1/1,7156.293844
1995/2/1,7355.283345
1995/3/1,7437.310734
1995/4/1,7416.313527
1995/5/1,7460.807609
1995/6/1,7566.993484
1995/7/1,7708.174705
1995/8/1,7900.14917
1995/9/1,8078.825404
1995/10/1,7838.857323
1995/11/1,7847.556166
1995/12/1,8000.121333
1996/1/1,8564.460807
1996/2/1,8739.937466
1996/3/1,8928.61237
1996/4/1,9164.480996
1996/5/1,9602.522731
1996/6/1,9818.89395
1996/7/1,9978.272751
1996/8/1,10481.70579
1996/9/1,10385.91853
1996/10/1,10491.30451
1996/11/1,10485.10533
1996/12/1,10888.85163
1997/1/1,11187.31193
1997/2/1,11485.87222
1997/3/1,11493.17125
1997/4/1,11751.63687
1997/5/1,12381.35311
1997/6/1,12528.13358
1997/7/1,12170.98109
1997/8/1,12356.15646
1997/9/1,12512.33568
1997/10/1,12622.52103
1997/11/1,12577.92696
1997/12/1,13005.77005
1998/1/1,12991.97189
1998/2/1,13387.51927
1998/3/1,13496.70475
1998/4/1,13723.97452
1998/5/1,13953.04405
1998/6/1,14105.22381
1998/7/1,14403.68411
1998/8/1,14474.27472
1998/9/1,14564.84967
1998/10/1,14605.31229
1998/11/1,14487.43697
1998/12/1,14859.83943
1999/1/1,14955.74368
1999/2/1,15230.26816
1999/3/1,16369.66361
1999/4/1,16987.51142
1999/5/1,16572.25266
1999/6/1,16279.75457
1999/7/1,16456.35707
1999/8/1,16511.55473
1999/9/1,16603.18554
1999/10/1,17124.46821
1999/11/1,17734.04512
1999/12/1,18216.00002
2000/1/1,18626.44142
2000/2/1,18921.88113
2000/3/1,19779.07011
2000/4/1,20434.60391
2000/5/1,20947.98963
2000/6/1,21106.76351
2000/7/1,22049.31613
2000/8/1,23105.41266
2000/9/1,23532.77481
2000/10/1,24208.14698
2000/11/1,24441.57493
2000/12/1,24409.54919
2001/1/1,24246.62886
2001/2/1,24862.04368
2001/3/1,25465.56195
2001/4/1,26131.34622
2001/5/1,27058.2102
2001/6/1,27464.30456
2001/7/1,28003.41849
2001/8/1,28651.27279
2001/9/1,29436.77722
2001/10/1,30107.91949
2001/11/1,31195.0912
2001/12/1,34940

Low-frequency data:

period,Value
1994/12/1,7012.132445
1995/1/1,
1995/2/1,
1995/3/1,
1995/4/1,
1995/5/1,
1995/6/1,
1995/7/1,
1995/8/1,
1995/9/1,
1995/10/1,
1995/11/1,
1995/12/1,8000.121333
1996/1/1,8564.460807
1996/2/1,8739.937466
1996/3/1,8928.61237
1996/4/1,9164.480996
1996/5/1,9602.522731
1996/6/1,9818.89395
1996/7/1,9978.272751
1996/8/1,10481.70579
1996/9/1,10385.91853
1996/10/1,10491.30451
1996/11/1,10485.10533
1996/12/1,10888.85163
1997/1/1,11187.31193
1997/2/1,11485.87222
1997/3/1,11493.17125
1997/4/1,11751.63687
1997/5/1,12381.35311
1997/6/1,12528.13358
1997/7/1,12170.98109
1997/8/1,12356.15646
1997/9/1,12512.33568
1997/10/1,12622.52103
1997/11/1,12577.92696
1997/12/1,13005.77005
1998/1/1,12991.97189
1998/2/1,13387.51927
1998/3/1,13496.70475
1998/4/1,13723.97452
1998/5/1,13953.04405
1998/6/1,14105.22381
1998/7/1,14403.68411
1998/8/1,14474.27472
1998/9/1,14564.84967
1998/10/1,14605.31229
1998/11/1,14487.43697
1998/12/1,14859.83943
1999/1/1,14955.74368
1999/2/1,15230.26816
1999/3/1,16369.66361
1999/4/1,16987.51142
1999/5/1,16572.25266
1999/6/1,16279.75457
1999/7/1,16456.35707
1999/8/1,16511.55473
1999/9/1,16603.18554
1999/10/1,17124.46821
1999/11/1,17734.04512
1999/12/1,18216.00002
2000/1/1,18626.44142
2000/2/1,18921.88113
2000/3/1,19779.07011
2000/4/1,20434.60391
2000/5/1,20947.98963
2000/6/1,21106.76351
2000/7/1,22049.31613
2000/8/1,23105.41266
2000/9/1,23532.77481
2000/10/1,24208.14698
2000/11/1,24441.57493
2000/12/1,24409.54919
2001/1/1,24246.62886
2001/2/1,24862.04368
2001/3/1,25465.56195
2001/4/1,26131.34622
2001/5/1,27058.2102
2001/6/1,27464.30456
2001/7/1,28003.41849
2001/8/1,28651.27279
2001/9/1,29436.77722
2001/10/1,30107.91949
2001/11/1,31195.0912
2001/12/1,34940

I want to use the later date in my low-frequency data to be included with the high-frequency data in the chow-lin model in order to get a better beta for the regression, and backcasting to fill in the gap in the low frequency data. The tempdisagg package can work on this case. Here is the chow-lin interpolated result for the data above:

Value
6943.73317
7101.2483
7318.67027
7408.29589
7566.99348
7433.96929
7549.9912
8078.8254
7914.00722
8109.2343
8000.12133
7856.54251
8023.23984
8928.61237
8831.58542
9037.73735
9818.89395
9774.07228
10010.486
10385.9185
10734.6944
10630.0343
10888.8516
10738.4088
11179.5543
11493.1713
11831.8771
11839.8523
12528.1336
12810.3068
12970.6836
12512.3357
12782.7761
12953.4223
13005.7701
13025.0893
13492.5637
13496.7048
13909.6745
14028.9738
14105.2238
14527.584
14693.8602
14564.8497
15097.0965
15196.0614
14859.8394
15111.4779
15518.3762
16369.6636
15923.1179
17168.0559
16279.7546
17389.4109
17069.8186
16603.1855
17323.0905
17423.2091
18216
18658.8206
19185.4192
19779.0701
19956.6863
20893.277
21106.7635
22170.474
22343.9552
23532.7748
24527.7399
24994.6888
24409.5492
25987.6714
25952.6791
25465.562
26447.0883
27106.5107
27464.3046
28846.6866
29290.3976
29436.7772
30587.3152
31445.5812
34940
33366.7684
37458.5695
steveshaoucsb commented 2 months ago

Any updates, or do you need any more explanation to the problem?

steveshaoucsb commented 2 months ago

Any updates, or do you need any more explanation of the issue?

jessegrabowski commented 2 months ago

Hi,

1) I haven't had any time to work on this over the past few weeks, and that's unlikely to change for at least another 2 weeks. I'm happy to look at a PR if you open one to add the functionality you need, which; 2) I consider a low priority, because it is just a standard regression task with correlated standard errors. You can accomplish what you are trying to do with default statsmodels, because you are not doing anything with disaggregation in this case. I am not convinced this functionality should even be in the package. 3) Please don't spam the repository with identical comments.

steveshaoucsb commented 2 months ago

Thanks for letting me know. I will take a look in the next couple of days on this matter and maybe create a separate pull request for this, considering there are some other higher-priority issues mentioned in this pull request. Apologize for sending the same message on this.