Possible disconnect between X and y values

ottonemo commented 5 years ago

There is a possible disconnect between ordering of y and X when using a skorch net as transformer in a sklearn pipeline while enabling shuffling on the valid iterator which is a very easy mistake to make (by coincidence).

To explain the basic issue, consider the following setup:

class TransformingNet(NeuralNet):
    def transform(self, X):
        return forward(X)

net = TransformingNet(module, iterator_valid__shuffle=True)
knn = KNearestNeighbor(n_neighbor=3)

pipe = Pipeline(net, knn)
pipe.fit(X_train, y_train)
pipe.predict(X_valid, y_valid)

The ordering of the embedded X_valid by TransformingNet will not correlate with the ordering of y_valid since the data gets shuffled by iterator_valid. Normally there is no need to shuffle the validation data which is why I wouldn't think of this as a big issue. However this behavior might not be as obvious all the time. Consider the following scenario:

class MyFancySampler(Sampler):
    """Basically torch.utils.data.sampler.RandomSampler"""
    def __init__(self, dataset):
        self.dataset = dataset
    def __iter__(self):
        return (i for i in torch.randperm(len(self.dataset))

class MyFancyLoader(DataLoader):
    def __init__(self, ds, **kw):
        kw['sampler'] = MyFancySampler(ds)
        super().__init__(ds, **kw)

    def __iter__(self):
        # ... <intricate iteration code> ...

net = TransformingNet(
          module, 
          iterator_train=MyFancyLoader,
          iterator_valid=MyFancyLoader,
)

This code hides the sorting disconnect pretty well and I spent a couple of hours huntin for a bug that was pretty similar to this.

I think that this (among other things) warrants a separate iterator_eval which is completely disconnected from the training process.

BenjaminBossan commented 5 years ago

First of all, this is not restricted to using the net as a transformer, even for the classical use, this could come and bite you. Furthermore, this is an issue that can easily happen with vanilla sklearn or other such frameworks.

I wonder if a separate iterator_eval helps. What would prevent the user from using MyFancyLoader on iterator_eval that wouldn't prevent the user from using it on iterator_valid? If they have this special loader, chances are that predicting wouldn't work correctly without it.

Perhaps we can solve this better with better documentation. We could also issue a warning when we detect iterator_valid__shuffle=True. However, if the user wants to use a special loader that shuffles the data, I'm afraid we cannot do much about it.

ottonemo commented 5 years ago

You are correct, especially

If they have this special loader, chances are that predicting wouldn't work correctly without it.

is a good point. I am not sure there is a big issue with the train/valid loader but I would still argue that an eval loader would be helpful in certain situations.

I think my argument stems from the specific problem I was solving. Maybe it helps the discussion if I outline my problem. The task I was solving was a metric learning problem where, at training and validation time, triples of (anchor, twin, nontwin) are processed. The model lives in a sklearn pipeline and is used to transform/embed the data for a KNN classifier during inference. There is a duality here since during inference there are no triples (since there is no explicit y to find twins, obviously).

People(tm) came up with the following ways of modeling this problem:

Use the data loader / dataset structure: define a custom sampler that creates the triples from a custom dataset, use the standard Dataloader without the sampler in transform() - the dataset only needs to implement methods required by the sampler to retrieve triples and has an otherwise standard __getitem__, the duality is contained in the sampler instance
Use a custom dataset: build a dataset that creates triples in __getitem__ and acts as a standard skorch dataset when no y is present (duality resides in dataset only)

Basically, both approaches use the loader / dataset infrastructure to generate data that depends on y which is why it is not easily implemented as a sklearn pipeline element (and it wouldn't support streaming which is kind of relevant since the triple combinations may not fit into RAM).

If there was an eval_iterator one could change train_ and valid_iterator to something that is sampling the way the training process needs to be while retaining 'standard' behavior for cases where there is no transformation necessary to the input. I am not sure though if it wouldn't be best to fit these things separately and have a 'transfer' step where the trained model from pipeline 1 is moved to a NeuralNetTransformer in pipeline 2 with standard data loading.

BenjaminBossan commented 5 years ago

So I couldn't completely understand each argument, I probably would have to implement this myself or see some code. But let me still make a proposal: Let's add a script or notebook for training a siamese or triplet net with skorch while trying to re-use as much as possible from pytorch and using best practices. This article and the linked sources could be helpful. Then we may find out if and how we need to change skorch to make this use case possible.

The first thing that would come to my mind though is to put the targets inside of X so that they are available to a possible transformer. If the target is not given, the Dataloader or Dataset would just return the input as is (the sampler would be the most natural fit).

As said earlier, I would also stack the triplets, so that the module doesn't need to do any special treatment. Then one could maybe override get_loss to unstack the data before passing it to TripletMarginLoss.

Would this already be helpful or do I miss something?

yv commented 5 years ago

For context: I am the People[tm] who implemented the Siamese network (in a straightforward fashion that mostly has custom code for everything that's not re-used directly from Skorch, triggering @ottonemo to adapt from this, and then having to debug, his version that tries to re-use as much of Skorch as possible). So we already have two metric learning implementations following slightly different styles.

There are three contexts in which you need to batch data for inference:

inside fit(), in the training loop - best practice here is to shuffle the data
inside fit(), in evaluating on heldout data - shuffling doesn't help or hurt here
inside transform() or predict(), creating predictions on new results, shuffling is definitely not wanted here (in words: NEVER, this would break the contract for an sklearn component).

I'd argue that in (3.) you always want the most boring thing which is to batch up the instances and put them through the model inference. without invoking anything fancy like dataset augmentation or reordering. Looking at AllenNLP, this is how it's done there: "allennlp evaluate" (get metrics on the testing set) uses the DataLoader that's also used for heldout data during training, whereas "allennlp predict" (create a JSON file with prediction) batches up the instances and feeds them through the predictor. [One way of realizing this would be to have a separate iterator_eval as suggested by Benjamin]

The current Skorch dataset already contains both X and y, so there's no disconnect to be feared except when shuffling is used in (3.) and breaks sklearn's contract of leaving the ordering of the data intact - in that sense I'd change the title of the issue to "reordering in transform() or predict_proba() breaks sklearn contract"

As an example where we'd want to reorder the heldout dataset in training (i.e. 2.), consider sorting examples by sequence length so you don't have too much padding. Depending on the length of the heldout dataset, this can be a welcome optimization.

BenjaminBossan commented 5 years ago

Thank you for explaining the situation so well.

One philosophy of skorch is that we should try to have the training, validation, and the actual prediction procedure be as close as possible, the same as sklearn does. Having unnecessary code duplication (or rather almost duplication) is a frequent source of bugs I see in the wild, where the model does something slightly different when predicting than it does when training.

But as you have laid out, there are rare occasions where you'd want to have 3 different procedures. Another philosophy we have is to make the common case easy and the special case possible. Perhaps a compromise could be: Leave the default as it is. Introduce a new iterator_eval (or whatever name) that is None by default, in which case it just acts like iterator_valid. If not None, the provided loader is used instead. If we do this, we would need to make sure that those 3 paths are clearly delimited in the code.

skorch-dev / skorch

Possible disconnect between X and y values #489