The evaluation metrics when used on samples have intrinsic uncertainties.
Given the samples we get, there are only a few places when random draws are used in the code, and that is under the context of up- or down-sampling if we are (for various reasons) working with more or less than 1000 samples. Bootstrapping seed-numbers could here be an option.
However, we could also think of the samples as samples from a parent distribution. As discussed in the workshop, 1000 samples might not be sufficient to properly describe the zero-inflated and heavy tailed distributions many of you are working with. It would be possible to try to infer various distribution that could have yielded the samples we are given, and then bootstrap samples from a range of possible distributions. It could also be interesting to explore how much metrics change when we increase the sample size (at least for a few cases to test the robustness of our sample-size choice).
An important comment here is that given the relatively short test-period, we are not able to properly test the tail-properties anyways. It would be useful to get some idea of how far into the tail the test-window(s) we are working with can reasonably be applied to (I'm lacking the proper language here, but I hope you understand).
The evaluation metrics when used on samples have intrinsic uncertainties.
Given the samples we get, there are only a few places when random draws are used in the code, and that is under the context of up- or down-sampling if we are (for various reasons) working with more or less than 1000 samples. Bootstrapping seed-numbers could here be an option.
However, we could also think of the samples as samples from a parent distribution. As discussed in the workshop, 1000 samples might not be sufficient to properly describe the zero-inflated and heavy tailed distributions many of you are working with. It would be possible to try to infer various distribution that could have yielded the samples we are given, and then bootstrap samples from a range of possible distributions. It could also be interesting to explore how much metrics change when we increase the sample size (at least for a few cases to test the robustness of our sample-size choice).
An important comment here is that given the relatively short test-period, we are not able to properly test the tail-properties anyways. It would be useful to get some idea of how far into the tail the test-window(s) we are working with can reasonably be applied to (I'm lacking the proper language here, but I hope you understand).