Closed ig248 closed 4 years ago
@ali-tny Indeed - if I use it as before, the validation predict
will only be able to compute lags against timesteps seen during fit
. This is "safe", but it means all lag values will be nan
. And even worse, same issues will appear at training time if using short (or non-temporal) batches. This is really a hack in place of manually adding lag columns to the input data. Importantly, I am only using it to lag features, not to lag the target - though there might be uses.
Merging #18 into master will increase coverage by
0.02%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #18 +/- ##
==========================================
+ Coverage 88.80% 88.82% +0.02%
==========================================
Files 34 34
Lines 1662 1665 +3
==========================================
+ Hits 1476 1479 +3
Misses 186 186
Impacted Files | Coverage Δ | |
---|---|---|
timeserio/preprocessing/datetime.py | 96.96% <100.00%> (+0.07%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 20c0013...8138201. Read the comment docs.
FYI I added a test for the refit=False
behaviour into your commit
Before this change, a
LagFeaturizer
would always respect causality and avoid data leakage (good). This means however that on-the-fly validation with "safe" features such as historic temperature lags could not be performed. After this change, the LagFeaturizer can be explicitly pre-fitted on all available data for the feature in question, and set to "frozen" using e.g.set_params(refit=False)
. This advanced use case requires extra care, but the change is introduced without changing default behaviour.