Closed kmedved closed 4 years ago
Hey @kmedved,
Thanks for your input, this is indeed a scenario that we have not considered much so far! You're right that auto_fillna
is not ideal with this sort of data, since usually that function would be used in cases where missing data is the exception rather than the norm. Since Darts forecasting models require continuous data at steady intervals, there is not much our library can do right now to make good use of data with large missing chunks. However, if the missing data in a time series occurs with a constant frequency and duration (such as always during the same months of the year for instance), then you could use the darts.utils.missing_values.fillna
function to assign a constant value to these indices. In theory, many of the implemented forecasting models should be able to pick up on this type of seasonality.
That being said, this does not address your concern about evaluating the performance of a model given this large amount of 'artificial' data. A quick workaround that comes to mind to solve this would be to set all values of your predicted time series that correspond to missing data in the input dataset to some constant value, for example to the same constant value that you used with the 'fillna' function. After this replacement, you have a new predicted time series which has the same 'filler' values as the ground truth time series. If you take the MAPE of this new predicted time series and the ground truth time series, only the predictions of values that were present in the original dataset will affect the score when comparing different models. (Although, of course, more missing values will lead to a lower MAPE, but this is dataset-specific and the effect will stay the same if you compare models on the same dataset.) To perform this replacement, you could use the TimeSeries.update
function before passing the prediction to the MAPE function.
I hope that I understood your problem correctly and that the workaround I'm suggesting is understandable. If not, please don't hesitate to ask further questions. Also, if you can think of better ways to deal with this type of data we would be happy to hear your ideas! We welcome outside contributions to our code as well :)
Thanks @pennfranc - you've understood the problem exactly. That's a good idea to swap in identical data for performing the MAPE calculations for the 'artificial' data, although it obviously don't quite fix the issue entirely.
One other idea is sample weights. Do any of the models in Darts support 'sample_weights' for observations perhaps? I haven't seen that in the documentation, but I'm not sure if I'm missing anything.
I appreciate the quick feedback and help with this somewhat esoteric issue. It's a common one in sports data (offseasons), but I appreciate it's not the main use case for time series forecasting. I'll look into the TimeSeries.update
functionality you mention in the meantime. Thanks.
@kmedved At this point none of the models support 'sample_weights' functionality. To me this does not seem like an easy feature to add since training a forecasting model usually requires temporally continuous data points with regular frequencies. But if you can think of any solutions please let us know! On the other hand, I could see 'sample_weights' being added as optional parameter to metric functions. I will put this on our radar!
Hello - I recently discovered your library and it is very intriguing!
I frequently work with time series data which has regular breaks, during which there are no observed measurements (e.g., during the winter). However, when I try the darts library on this data, I get NaN MAPE error results, presumably due to these irregular intervals, as the forecasting methods don't know what to do with the gap periods.
I've tried using
auto_fillna
to fill the gaps for these time periods, which has successfully gotten the package to work, but I'm concerned that the errors and calibrations are biased, since the models are now being fit on inputed data, which makes up ~50% of observations in many cases. So when I try to compare models, half the resulting MAPE score is a function of which model is best at handling the inputed data.Do you have a recommended workaround here? Any way to get the models to not score the time periods without actual observed data?
Thanks!