[improvement] max_samples_per_ts always sample the recentest, and I want to sample uniformly

hyh957947142 commented 1 year ago

As the title says, max_samples_per_ts always sample the recent. This results in a very long timeseries where only the last small segment is sampled. I want the dataset to be randomly sampled and the number of random samples can be told instead, but not just the most recent samples. Just like the sampler in gluonts, you can set batch_size and num_batches_per_epoch at the same time, with uniformly sampling. Don't have to be the same as gluonts. I just don't want to sample all batches of a very long sequence with stride one, nor just only sample the last small segment of the sequence.

Although I can now achieve what I want by modifying the source code, I still hope that the official darts can implement it.

Finally, does dart itself implement this, which I haven't found?

dennisbader commented 1 year ago

Hi @hyh957947142 and thanks for raising this.

I do agree that there are cases which could benefit from randomly sampling. Especially when we are still interested in the entire historic time range of the data and want to reduce training time.

First thing to think about: do want to add an additional parameter to fit() @hrzn ? I personally wouldn't mind for this one.

I could imagine something like this:

ts_sampling_method: str = "latest" which accepts either "latest" or "uniform". We would always sample n=max_samples_per_ts samples.

Let me know what you think.

hrzn commented 1 year ago

I agree, it's a good idea, and I like your proposal @dennisbader. Let's add this one to the backlog :) @hyh957947142 if you made it work locally in your case, would you be willing to open a PR?

PS: Note that you can already make it work without changing Darts code (although it's a little more laborious), by creating your own TrainingDataset. See for instance here for an example. You can then call fit_from_dataset(my_dataset) instead of fit().

hyh957947142 commented 1 year ago

Thank you @hrzn @dennisbader very much for taking my suggestion into consideration. My local implementation uses a tricky way to achieve uniform sampling, which is not suitable for integration into darts.

hyh957947142 commented 1 year ago

from typing import Optional, Sequence, Tuple, Union
import random
import numpy as np

from darts import TimeSeries
from darts.utils.data import DualCovariatesSequentialDataset

from darts.utils.data.shifted_dataset import GenericShiftedDataset
from darts.utils.data.training_dataset import (
    DualCovariatesTrainingDataset,
    FutureCovariatesTrainingDataset,
    MixedCovariatesTrainingDataset,
    PastCovariatesTrainingDataset,
    SplitCovariatesTrainingDataset,
)
from darts.utils.data.utils import CovariateType

class TrainDataSet(MixedCovariatesTrainingDataset):
    def __init__(
            self,
            target_series: Union[TimeSeries, Sequence[TimeSeries]],
            past_covariates: Optional[Union[TimeSeries, Sequence[TimeSeries]]] = None,
            future_covariates: Optional[Union[TimeSeries, Sequence[TimeSeries]]] = None,
            input_chunk_length: int = 12,
            output_chunk_length: int = 1,
            max_samples_per_ts: Optional[int] = None,
            use_static_covariates: bool = True,
    ):
        """
        """

        super().__init__()

        self.length = max_samples_per_ts   # (###### i changed here ##########)

        # This dataset is in charge of serving past covariates
        self.ds_past = GenericShiftedDataset(
            target_series=target_series,
            covariates=past_covariates,
            input_chunk_length=input_chunk_length,
            output_chunk_length=output_chunk_length,
            shift=input_chunk_length,
            shift_covariates=False,
            max_samples_per_ts=None,  # (###### i changed here ##########)
            covariate_type=CovariateType.PAST,
            use_static_covariates=use_static_covariates,
        )

        # This dataset is in charge of serving historical and future future covariates
        self.ds_dual = DualCovariatesSequentialDataset(
            target_series=target_series,
            covariates=future_covariates,
            input_chunk_length=input_chunk_length,
            output_chunk_length=output_chunk_length,
            max_samples_per_ts=None,   # (###### i changed here ##########)
            use_static_covariates=use_static_covariates,
        )

    def __len__(self):
        return self.length   # (###### i changed here ##########)

    def __getitem__(
            self, idx
    ) -> Tuple[
        np.ndarray,
        Optional[np.ndarray],
        Optional[np.ndarray],
        Optional[np.ndarray],
        Optional[np.ndarray],
        np.ndarray,
    ]:
        idx = random.randint(0,len(self.ds_past)-1)   # (###### i changed here ##########)
        past_target, past_covariate, static_covariate, future_target = self.ds_past[idx]
        _, historic_future_covariate, future_covariate, _, _ = self.ds_dual[idx]
        return (
            past_target,
            past_covariate,
            historic_future_covariate,
            future_covariate,
            static_covariate,
            future_target,
        )

This is how i implement TFTModel 's training dataset. I made a modification based on darts.utils.data.sequential_dataset.MixedCovariatesSequentialDataset, and I marked the modified place with a comment like (####### i changed here ##########)

hrzn commented 1 year ago

Thanks @hyh957947142. Leveraging another dataset (eg. another MixedCovariates dataset) which we query with random idx seems like a good idea, at least for a starting point.

unit8co / darts

[improvement] max_samples_per_ts always sample the recentest, and I want to sample uniformly #1348