jessegrabowski / tsdisagg

Tools for converting low time series data to high frequency
MIT License
4 stars 2 forks source link

Quarter-to-monthly interpolation: Always get into error when the high-frequency data has one or more than one quarter of data available to interpolate #6

Closed steveshaoucsb closed 2 months ago

steveshaoucsb commented 3 months ago

Due to the issue with the calculation formula in the line 22 of ts_disagg.py, excess = n - i_len * nl the program will always generate an extra period in my test scenario, leading the program to crash In my test case, low-frequency quarterly data starts on 1995/6/1, and ends on 2001/12/1; high-frequency monthly data starts on 1995/3/1, and ends on 2001/12/1. The program will show up the following error: IndexError: boolean index did not match indexed array along dimension 0; dimension is 28 but corresponding boolean dimension is 27

steveshaoucsb commented 3 months ago

Further note: if I bring the data one more quarter earlier, say 1994/12/1, the error becomes this: IndexError: boolean index did not match indexed array along dimension 0; dimension is 29 but corresponding boolean dimension is 27

jessegrabowski commented 3 months ago

Sounds like that logic is broken. If you want to open a PR that just adds your test (that doesn't pass), that'd be a good first step for fixing it. Otherwise I'll try my best to get to it by the end of the week

steveshaoucsb commented 3 months ago

The data that I used to reproduced the error has uploaded into the latest pull request!

jessegrabowski commented 3 months ago

Thanks! I started looking at it over the weekend. I think the problem happens before the C matrix. I'm doing something quite dumb with this C_mask business, so I need to rethink how to check if there are too many/too few observations.

In your case, the time series aren't "aligned" -- there are too many high-frequency observations before the first low frequency observation. So we need some logic to work out how to align them. For yearly data I check that the set of years is equal in the high and low frequency data, but that doesn't work quarterly, since there's only 4 quarters. Maybe check the set of year-quarter?

I'm a bit swamped at work this week but I'll do my best to have a look. If you feel inspired, feel free as well. I think the problem function is actually handle_endpoint_differences

steveshaoucsb commented 3 months ago

For the point that you mentioned regarding some more observations available earlier than the starting date of the low-frequency data, when I did the same task for R, R can do the interpolation properly, and interpolate the low-frequency data back to the starting point of the high-frequency data. The Python package that you wrote actually works for annual-monthly cases with monthly high-freq data starting earlier than the annual one. So I think this problem is more about how to make backcasting of the low-freq data work to the earlier date when high-freq data provides data for the earlier date in the quarter-monthly case.

For annual-quarter cases, I don't think I have encountered issues over that so far but I will have a try. I am working with a huge amount of data that needs to be interpolated with your package this week. I will pay attention to that case as well. If I have time, I will create a separate case to test that and see whether it works!

steveshaoucsb commented 3 months ago

Tested with annual-to-quarter case today. The backcasting is working for this case.

jessegrabowski commented 3 months ago

Nice! I'm glad something is working. Would you be willing to make a PR with the rest you ran so it can be included in the testing suite? It will be useful if we start tinkering with the Q->M case, to make sure we don't break what currently works.

steveshaoucsb commented 3 months ago

The year-to-quarter result and test cases has added, and it's available in the pull request that I opened a few days ago

steveshaoucsb commented 3 months ago

I've been sick during the weekend so didn't take a look, but any updates on how to fix this issue?

jessegrabowski commented 3 months ago

Hey, thanks for the poke (and the PR!)

I have been overloaded with stuff the last week or so, hoping to get to this soon

steveshaoucsb commented 2 months ago

We are aiming to make some public release of some economic data next week(hope to share it with you once it got published), and it will be highly appreciated if you can publish the fix to both the issue I raised as early as you can, as your package plays a vital role in interpolating our dataset. I understand that you might have packed schedule so take your time on fixing these bugs. I am currently overwhelmed by the rest of the work on the dataset and hopefully I can wrap them up early and work on the PR for the fix!

jessegrabowski commented 2 months ago

I'll put a couple hours into this today, hopefully I can make some headway

jessegrabowski commented 2 months ago

I cut a new release that I think fixes this and maybe also #9. Can you do pip install tsdisagg --update and open a new issue if you hit more errors.