(Proposing fixes) Adding sampler arg to the data loader causes bugs in get_idxs() function

ognjenantonijevic commented 1 year ago

Hi, first of all, thanks for the awesome package, and really helpful tutorials and docs!

The problem is the following: I am performing time series classification with imbalanced classes, so I wanted to test Stratified Batch Sampling. I've used two StratifiedSamplers with ys from training and validation sets, which I've passed as sampler=[sampler_train, sampler_valid] to the get_ts_dls function.

Problem 1) But the output from https://github.com/timeseriesAI/tsai/blob/main/tsai/data/core.py#L565 returns a 2D array with shape (1,len(y)), which causes errors when trying to train the learner. Fix would be to add [0] at the end of this line (drop the extra first dimension of returned numpy array).

Problem 2) However, this only partly fixes the problem, since the second part of the problem occurs when using the trained model to infer on the new data using learn.get_X_preds(), which uses the new data to create a new data loader > https://github.com/timeseriesAI/tsai/blob/main/tsai/inference.py#L18

But this new data loader is then forwarded to the get_preds() function of fastai's Learner which uses the get_idxs() method of the data loader > https://github.com/fastai/fastai/blob/master/fastai/learner.py#L294

This results in the inference having constant number of results (the same number as the len(y) used when constructing the StratifiedSampler > https://github.com/timeseriesAI/tsai/blob/main/tsai/data/core.py#L820

Fix add **kwargs to the function def in: https://github.com/timeseriesAI/tsai/blob/main/tsai/data/core.py#L533 and function call in https://github.com/timeseriesAI/tsai/blob/main/tsai/data/core.py#L538

and add sampler=None to function call in: https://github.com/timeseriesAI/tsai/blob/main/tsai/inference.py#L17

ognjenantonijevic commented 1 year ago

Or I'll just create a PR in the next few days when I find the time

oguiza commented 1 year ago

Hi @ognjenantonijevic , There are 3 approaches already available in tsai to handle target imbalance:

you can pass individual sample weights when building the dataloaders:
```
get_ts_dls(X, y=None, splits=None, sel_vars=None, sel_steps=None, tfms=None, inplace=True,
           path='.', bs=64, batch_tfms=None, num_workers=0, device=None, shuffle_train=True, drop_last=True, 
           weights=None, partial_n=None, sampler=None, sort=False, **kwargs)
```
in this way you can assign any weight you want per class or per sample. But remember that len(y) must be equal to len(weights). These weights are passed to the dataloader. Samples with a higher weight will be picked more often by the dataloader when creating batches. In this way, the number of iterations will be the same. It's the probability that a given sample is picked that changes.
the second option is to modify the splits to balance them. You can use this:
```
train_split, valid_split = splits
balanced_idxs = balance_idx(y[train_split])
new_train_split = train_split[balanced_idxs]
new_splits = (new_train_split, valid_split)
new_splits
```
This approach ensures all samples are selected at least once per epoch at the expense of increasing the iterations per epoch.
the 3rd approach is to use a weighted loss function (like torch.nn.CrossEntropyLoss(weight=None)). There are several functions that allow you to pass some weights. In this case, the weight is a class weight. You can get those weights by using:
```
loss_fn = torch.nn.CrossEntropyLoss(weight=dls.cws)
```
and pass the loss_fn to the learner.

Based on all this, I'm not sure that the sampler is needed to handle target imbalance.

oguiza commented 1 year ago

Hi @ognjenantonijevic, Could you please let me know if you are still planning to submit a PR? Or should we close this issue?

oguiza commented 1 year ago

Closing this issue due to lack of activity and progress. If necessary please, create a new one.

timeseriesAI / tsai

(Proposing fixes) Adding sampler arg to the data loader causes bugs in get_idxs() function #626