Open ottonemo opened 5 years ago
First of all, this is not restricted to using the net as a transformer, even for the classical use, this could come and bite you. Furthermore, this is an issue that can easily happen with vanilla sklearn or other such frameworks.
I wonder if a separate iterator_eval
helps. What would prevent the user from using MyFancyLoader
on iterator_eval
that wouldn't prevent the user from using it on iterator_valid
? If they have this special loader, chances are that predicting wouldn't work correctly without it.
Perhaps we can solve this better with better documentation. We could also issue a warning when we detect iterator_valid__shuffle=True
. However, if the user wants to use a special loader that shuffles the data, I'm afraid we cannot do much about it.
You are correct, especially
If they have this special loader, chances are that predicting wouldn't work correctly without it.
is a good point. I am not sure there is a big issue with the train
/valid
loader but I would still argue that an eval
loader would be helpful in certain situations.
I think my argument stems from the specific problem I was solving. Maybe it helps the discussion if I outline my problem. The task I was solving was a metric learning problem where, at training and validation time, triples of (anchor, twin, nontwin)
are processed. The model lives in a sklearn pipeline and is used to transform/embed the data for a KNN classifier during inference. There is a duality here since during inference there are no triples (since there is no explicit y
to find twins, obviously).
People(tm) came up with the following ways of modeling this problem:
Dataloader
without the sampler in transform()
- the dataset only needs to implement methods required by the sampler to retrieve triples and has an otherwise standard __getitem__
, the duality is contained in the sampler instance__getitem__
and acts as a standard skorch dataset when no y
is present (duality resides in dataset only)Basically, both approaches use the loader / dataset infrastructure to generate data that depends on y
which is why it is not easily implemented as a sklearn pipeline element (and it wouldn't support streaming which is kind of relevant since the triple combinations may not fit into RAM).
If there was an eval_iterator
one could change train_
and valid_iterator
to something that is sampling the way the training process needs to be while retaining 'standard' behavior for cases where there is no transformation necessary to the input. I am not sure though if it wouldn't be best to fit these things separately and have a 'transfer' step where the trained model from pipeline 1 is moved to a NeuralNetTransformer
in pipeline 2 with standard data loading.
So I couldn't completely understand each argument, I probably would have to implement this myself or see some code. But let me still make a proposal: Let's add a script or notebook for training a siamese or triplet net with skorch while trying to re-use as much as possible from pytorch and using best practices. This article and the linked sources could be helpful. Then we may find out if and how we need to change skorch to make this use case possible.
The first thing that would come to my mind though is to put the targets inside of X
so that they are available to a possible transformer. If the target is not given, the Dataloader
or Dataset
would just return the input as is (the sampler would be the most natural fit).
As said earlier, I would also stack the triplets, so that the module doesn't need to do any special treatment. Then one could maybe override get_loss
to unstack the data before passing it to TripletMarginLoss
.
Would this already be helpful or do I miss something?
For context: I am the People[tm] who implemented the Siamese network (in a straightforward fashion that mostly has custom code for everything that's not re-used directly from Skorch, triggering @ottonemo to adapt from this, and then having to debug, his version that tries to re-use as much of Skorch as possible). So we already have two metric learning implementations following slightly different styles.
There are three contexts in which you need to batch data for inference:
I'd argue that in (3.) you always want the most boring thing which is to batch up the instances and put them through the model inference. without invoking anything fancy like dataset augmentation or reordering. Looking at AllenNLP, this is how it's done there: "allennlp evaluate" (get metrics on the testing set) uses the DataLoader that's also used for heldout data during training, whereas "allennlp predict" (create a JSON file with prediction) batches up the instances and feeds them through the predictor. [One way of realizing this would be to have a separate iterator_eval as suggested by Benjamin]
The current Skorch dataset already contains both X and y, so there's no disconnect to be feared except when shuffling is used in (3.) and breaks sklearn's contract of leaving the ordering of the data intact - in that sense I'd change the title of the issue to "reordering in transform() or predict_proba() breaks sklearn contract"
As an example where we'd want to reorder the heldout dataset in training (i.e. 2.), consider sorting examples by sequence length so you don't have too much padding. Depending on the length of the heldout dataset, this can be a welcome optimization.
Thank you for explaining the situation so well.
One philosophy of skorch is that we should try to have the training, validation, and the actual prediction procedure be as close as possible, the same as sklearn does. Having unnecessary code duplication (or rather almost duplication) is a frequent source of bugs I see in the wild, where the model does something slightly different when predicting than it does when training.
But as you have laid out, there are rare occasions where you'd want to have 3 different procedures. Another philosophy we have is to make the common case easy and the special case possible. Perhaps a compromise could be: Leave the default as it is. Introduce a new iterator_eval
(or whatever name) that is None
by default, in which case it just acts like iterator_valid
. If not None
, the provided loader is used instead. If we do this, we would need to make sure that those 3 paths are clearly delimited in the code.
There is a possible disconnect between ordering of
y
andX
when using a skorch net as transformer in a sklearn pipeline while enabling shuffling on the valid iterator which is a very easy mistake to make (by coincidence).To explain the basic issue, consider the following setup:
The ordering of the embedded
X_valid
byTransformingNet
will not correlate with the ordering ofy_valid
since the data gets shuffled byiterator_valid
. Normally there is no need to shuffle the validation data which is why I wouldn't think of this as a big issue. However this behavior might not be as obvious all the time. Consider the following scenario:This code hides the sorting disconnect pretty well and I spent a couple of hours huntin for a bug that was pretty similar to this.
I think that this (among other things) warrants a separate
iterator_eval
which is completely disconnected from the training process.