skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.86k stars 390 forks source link

skorch.fit can't handle lists of lists with variable length #605

Closed econti closed 3 years ago

econti commented 4 years ago

I'm having a hard time figuring out how to pass a list of lists (with variable length) to skorch's fit method.

Specifically, I have a feature that is a list of ID's (e.g. [[1, 12, 3], [6, 22]...]) which are converted to a dense representation using an embedding table in my PyTorch module's forward method:

def forward(self, X_float, X_id_list):
    ...

When I call net.fit() on my data set (e.g. {"X_float": ..., "X_id_list": ...} I get the following error caused by the list of lists:

ValueError: Dataset does not have consistent lengths.

I've also tried converting the list of lists to a pandas dataframe and numpy array (of objects) and neither works. How do you handle variable length lists of lists in skorch.fit?

cgarciae commented 4 years ago

Don't know about the specifics of this in skorch but generally you need to add padding / perform slicing so every sample has the same length. The only exception of this are Tensorflow's Ragged Tensors, but even then you have to specify a default value to pad with when converting to regular tensors (Pytorch doesn't have Ragged Tensors yet).

BenjaminBossan commented 4 years ago

@econti Could you check whether PackedSequence solves your issue?

Otherwise, we have an example here that shows how to potentially deal with variable length sequences.

econti commented 4 years ago

Thanks @BenjaminBossan, that did the trick for me. Leaving a code snippet here for anyone else who encounters a similar issue:

# data["X_id_list"] is a pandas dataframe that hold variable length lists of lists, e.g.
# [[1, 3], [0, 40, 16], ...]

X_id_list = {}

for series_name, series in data["X_id_list"].iteritems():
    pre_pad = [torch.tensor(i) for i in series]
    X_id_list[series_name] = pad_sequence(
        pre_pad, batch_first=True, padding_value=0
    )
BenjaminBossan commented 4 years ago

@econti Great that you found a solution and thanks for the snippet.

ToddMorrill commented 4 years ago

I'm facing a similar issue right now and I suspect I'm doing the same thing that you're doing, which is padding to the longest sequence length in the dataset, which results in significantly more computation than would result from padding at the batch level. I suspect we need something like a collate_fn that operates at the batch level to solve this the right way.

BenjaminBossan commented 4 years ago

@ToddMorrill I don't know the exact details of your case, so maybe I'm missing something. In general through, collate_fn is designed to work on samples and not on batches. If you want to avoid any costly operation on each sample, you would have to provide your own DataLoader. You can pass it as iterator_train and iterator_valid to NeuralNet in skorch.

However, this is not the canonical way o fdealing with sequences of different lengths. Maybe you can make use of PackedSequence or pad_sequence.

ToddMorrill commented 4 years ago

"A custom collate_fn can be used to customize collation, e.g., padding sequential data to max length of a batch." Source That's what I'm trying to do.

Thanks for pointing me toward NeuralNet. I was using NeuralNetClassifier and totally missed the opportunity to use a custom DataLoader. I'll give that a shot.

I'm not opposed to using pad_sequence, it's just that I got started with torchtext and it was already doing a fantastic job taking care of all my text preprocessing needs, including padding, so I didn't want to rewrite that functionality.

ToddMorrill commented 4 years ago

To be sure, I'm trying to reuse the following torchtext code with skorch.

import torchtext
from torchtext import data
from torchtext import datasets

# set up fields
TEXT = data.Field(lower=True, batch_first=True, )
LABEL = data.Field(sequential=False, unk_token=None)

# takes approx. 10 minutes to download data and embeddings (will be cached for re-use)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# will be used to initialize model embeddings layer
vocab = torchtext.vocab.GloVe(name='6B', dim=100)

# build the vocabulary
max_size = 25_000 # shorten for demonstrative purposes
TEXT.build_vocab(train, vectors=vocab, max_size=max_size)
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_sizes=(32, 64), device='cpu')

So far, I haven't found a way to reuse train_iter with Skorch. train_iter is used in a for loop and yields batches of data padded to the longest length sequence in the batch. It also buckets batches by sequence length to reduce computation. Each batch has a .text and a .label attribute that contain the numericalized data and label representation, respectively.

I welcome any suggestions on recycling this code.

ToddMorrill commented 4 years ago

My apologies for all the posts but I just wanted to share a quick update before signing off and ask a question.

I created a custom dataset and then implemented a custom collate_fn as follows:

def pad_batch(batch):
    text, label = list(zip(*batch))
    padded_batch = pad_sequence(text, batch_first=True, padding_value=1)
    return padded_batch, torch.cat(label)

skorch_model = NeuralNet(
                CNN,
                device=device,
                max_epochs=2,
                lr=0.001,
                optimizer=optim.Adam,
                criterion=nn.NLLLoss,
                iterator_train__collate_fn=pad_batch,
                iterator_train__shuffle=True,
                iterator_valid__collate_fn=pad_batch,
                iterator_valid__shuffle=False,
                train_split=skorch.dataset.CVSplit(.2), # NB: this witholds 20% of the training data for validation
                module__n_filters=100,
                module__filter_sizes=(2,3,4),
                module__dropout=0.2,
                module__pretrained_embeddings=TEXT.vocab.vectors,
                batch_size=32,
                verbose=2)

skorch_model.fit(train_dataset)

What's amazing about padding at the batch level is that run times went from 60 seconds per epoch to 20 seconds per epoch - a huge improvement. However, I was liking all of the functionality I had while using NeuralNetClassifier, namely all of the scoring functions. NeuralNetClassifier insists on having skorch_model.fit(X, y) and fails with skorch_model.fit(train_dataset). Do you have a way around this so that I can use NeuralNetClassifier with my custom dataset and custom dataloader?

I'm still interested in recycling the torchtext functionality so if you have thoughts on that, I still welcome them!!

Thanks for all of your help! I'm loving skorch.

BenjaminBossan commented 4 years ago

Thanks for all of your help! I'm loving skorch.

That's great to hear, thanks.

Thanks for pointing me toward NeuralNet. I was using NeuralNetClassifier and totally missed the opportunity to use a custom DataLoader. I'll give that a shot.

Sorry that I have confused you, you can do the same thing with NeuralNetClassifier, I just used NeuralNet as a stand in for all the derived classes.

NeuralNetClassifier insists on having skorch_model.fit(X, y) and fails with skorch_model.fit(train_dataset)

It depends a bit. What does your target look like? Potentially, it could be possible to extract it and pass it as y. But that only really makes sense if you work on a (multiclass) classification problem -- is that the case for your dataset? If you want to do, say, seq2seq, I don't see how that can work with NeuralNetClassifier.

namely all of the scoring functions

Note that you can use the scoring functions also with NeuralNet, have a look at EpochScoring.

ToddMorrill commented 4 years ago

I'm making progress on my example text classification pipeline using NeuralNetClassifier. Have a look here. I managed to recycle the useful parts of torchtext (e.g. TEXT.process(batch), etc.) but did indeed have to use a custom collate_fn inside of DataLoader. Most importantly to me, run times have been reduced dramatically. I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext. I'll bet torch.nn.utils.rnn.pack_padded_sequence would be helpful here, as you pointed out @BenjaminBossan, but it just requires me to implement more functionality. The bottom line is I was hoping to make use if torchtext's functionality from start to finish. That does not appear to be possible with Skorch at this stage. If there is anything I can do to help make this possible, please let me know.

BenjaminBossan commented 4 years ago

I believe it makes a lot of sense to make skorch work with popular libraries like torchtext and torchvision. When we released skorch, the former didn't exist yet, so now we might be in a place where not everything works together. However, there might still be a way. I would need to look more thoroughly at what torchtext provides and see what we can do, once I have a bit of time.

@ToddMorrill please keep us up-to-date if you find some better solution.

kqf commented 4 years ago

Hi guys, Sorry for the noise if it's not actual anymore. But I wasn't able to find any usage of skorch + torchtext, and this is the only thread that bumps up in google.

@ToddMorrill

I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext.

I have good news for you :D skorch supports pytorch datasets, the same convention is followed by torchtext. In fact, all their datasets are inherited from torch.utils.data.Dataset. In theory, this makes them compatible with skorch. As for me, it's a beautiful example of great design and implementation. Both teams followed the same conventions imposed by pytorch and ended up with two independent libraries that are compatible with each other.

Here I prepared a short example (somewhat similar to the one provided by @ToddMorrill ) how to integrate torchtext into skorch pipeline:

import torch
import skorch
import random
import numpy as np
import pandas as pd
from torchtext.data import BucketIterator, Example, Dataset, Field, LabelField
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline

SEED = 137

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

def data(size=1000):
    return pd.DataFrame({
        "query": ["This is a duck", "This is a goose"] * size,
        "target": [0, 1] * size,
    })

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, fields, need_vocab=None):
        self.fields = fields
        self.need_vocab = need_vocab or {}

    def fit(self, X, y=None):
        dataset = self.transform(X, y)
        for field, min_freq in self.need_vocab.items():
            field.build_vocab(dataset, min_freq=min_freq)
        return self

    def transform(self, X, y=None):
        proc = [X[col].apply(f.preprocess) for col, f in self.fields]
        examples = [Example.fromlist(f, self.fields) for f in zip(*proc)]
        return Dataset(examples, self.fields)

def build_preprocessor():
    text_field = Field(lower=True)
    label_field = LabelField(is_target=True)
    fields = [
        ('query', text_field),
        ('target', label_field),
    ]
    return TextPreprocessor(fields, need_vocab={text_field: 0, label_field: 0})

class SimpleModule(torch.nn.Module):
    def __init__(self, vocab_size=100, emb_dim=16, lstm_hidden_dim=32):
        super().__init__()
        self._emb = torch.nn.Embedding(vocab_size, emb_dim)
        self._rnn = torch.nn.LSTM(emb_dim, lstm_hidden_dim)
        self._out = torch.nn.Linear(lstm_hidden_dim, 2)

    def forward(self, inputs):
        rnn_output = self._rnn(self._emb(inputs))[0]
        return torch.nn.functional.softmax(self._out(rnn_output[-1]))

class InputShapeSetter(skorch.callbacks.Callback):
    def on_train_begin(self, net, X, y):
        # NB: If your module relies on pretrained embeddings
        # net.set_params(module__embeddings=X.fields["query"].vocab.vectors)
        pass

def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        iterator_train=BucketIterator,
        iterator_valid=BucketIterator,
        train_split=Dataset.split,
        callbacks=[InputShapeSetter()],
    )
    full = make_pipeline(
        build_preprocessor(),
        model
    )
    return full

def main():
    df = data()
    assert type(df) == pd.DataFrame

    dataset = build_preprocessor().fit_transform(df)
    assert type(dataset) == Dataset

    # Putting it all together
    model = build_model().fit(
        df,  # pd.DataFrame, torchtext handles X and y
        0.7  # <<< ?? This sets split_ratio for Dataset.split
    )
    print(model.predict(df))
    assert model.score(df, df["target"]) > 0.5, "Fitting issues"

if __name__ == '__main__':
    main()

This code should work with the latest versions of the libraries. The only strange thing is that you have to pass split_ratio=0.7 through .fit method. I guess, this side effect is caused by this line in the skorch code. Perhaps, there's a better solution for this.

@BenjaminBossan It looks like you are a member of the dev team. Probably 594 is somehow related to the topic with torchtext. If you will raise an error on IterableDataset then you will lose this torchtext support. I might be wrong.

Once again sorry for spamming.

BenjaminBossan commented 4 years ago

@kqf Thanks for posting the example, I'm taking a look at it. At the end of the day, I think it would be nice to add a notebook that showcases how to use torchtext. Ideally, it should use one of the torchtext datasets like IMDB and pretrained embeddings.

The only strange thing is that you have to pass split_ratio=0.7 through .fit method

Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:

from functools import partial

def my_train_split(dataset, y, split_ratio):
    return dataset.split(split_ratio=split_ratio)

...

def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        iterator_train=BucketIterator,
        iterator_valid=BucketIterator,
        train_split=partial(my_train_split, split_ratio=0.7),
        callbacks=[InputShapeSetter()],
    )
    ...

model = build_model().fit(df)  # no need to pass split_ratio here
kqf commented 4 years ago

@BenjaminBossan

it should use one of the torchtext datasets like IMDB and pretrained embeddings.

It's totally doable, I didn't want to download the data/embeddings on my private laptop.

Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:

Yes, I agree, but that was the one of my intentions: to demonstrate that skorch is compatible with torchtext without extra code and and to show a strange skoch behaviour. I would expect if I pass.fit(X, y=None) then y will not be passed to the split function. I think it should be handled on skorch side and it deserves an issue on it's own 🤷

What do you think?

BenjaminBossan commented 4 years ago

It's totally doable, I didn't want to download the data/embeddings on my private laptop.

Yes, what you posted is a really good starting point.

without extra code

I think those two lines are acceptable :)

I would expect if I pass.fit(X, y=None) then y will not be passed to the split function.

I think that could make sense. Do you want to work on this change?

In the meantime, I tried to implement a torchtext example with skorch that's a bit closer to a real world problem someone could have. It uses skorch with torchtext and BERT (via huggingface). Here is the notebook:

https://nbviewer.jupyter.org/github/BenjaminBossan/playground/blob/master/skorch_torchtext_bert.ipynb

@kqf @ToddMorrill since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field, TEXT, LABEL, and build_vocab stuff does. For reference, my notebook is basically a re-implementation of this notebook.

The main change that I had to introduce was to slightly change BucketIterator:

class SkorchBucketIterator(BucketIterator):
    def __iter__(self):
        for batch in super().__iter__():
            # We make a small modification: Instead of just returning batch
            # we return batch.text and batch.label, corresponding to X and y
            yield batch.text, batch.label.long()

skorch basically really wants to always have an X and a y, because this is what sklearn expects. With the shown change, we get that. (I didn't quite get why batch.label is int32, surely there is a better way to change that.) Apart from this, I could re-use most of the code from the original notebook.

ping @ottonemo maybe this is also interesting for you.

kqf commented 4 years ago

I think that could make sense. Do you want to work on this change?

Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.

since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field, TEXT, LABEL, and build_vocab stuff does. For reference, my notebook is basically a re-implementation of this notebook.

I am not an expert in torchtext either, but your code looks fine. Those TEXT and LABEL are the instances of the field class. Fields are "applied" to examples to extract the information needed. The fields define all necessary transformations, and build_vocab is similar to .fit method for transformers (so you have to apply it to the train data only).

I like the way you are handling torchtext.data.Batch. It's really a good one.

skorch basically really wants to always have an X and a y, because this is what sklearn expects

I think this is important what you are saying. The default NeuralNet was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense add UnsupervisedNeuralNet or something like this. I think this still will be compatible with sklearn as they have support for clustering and Manifold learning.

kqf commented 4 years ago

@BenjaminBossan One more thing about examples with torchtext and it is directly related to the issue. Today I was trying to use skorch together with torchtext for metric learning. For this problem, you have to pass two fields to the forward method, and y should remain empty. I will not provide the full example here, as it may be a bit lengthy, but, probably it will be useful to have a notebook that shows how to achieve that?

In any event, if you have to pass multiple fields to forward method, you have to do two modifications:

  1. Edit the bucket iterator (similarly to the BERT example):

from operator import attrgetter

def batch2dict(batch): return {f: attrgetter(f)(batch) for f in batch.input_fields}

class SkorchBucketIterator(BucketIterator): def iter(self): for batch in super().iter():

We make a small modification: Instead of just returning batch

        # we return dict() and empty tensor, corresponding to X and y
        yield batch2dict(batch), torch.empty(0)

2. You have to use `Field(batch_first=True)` option when creating the fields, otherwise `skorch ` will complain about the inconsistent length of the dataset 

So, this should demonstrate how to use multiple fields with `skorch`, hope someone will find it useful. 
BenjaminBossan commented 4 years ago

Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.

No problem at all. If you need help along the way, just ask.

The default NeuralNet was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense add UnsupervisedNeuralNet or something like this. I think this still will be compatible with sklearn as they have support for clustering and Manifold learning.

NeuralNetClassifier, NeuralNetBinaryClassifier, and NeuralNetRegressor are explicitly modeled to be for supervised learning. NeuralNet is more open-ended and should be used for anything unsupervised. As with sklearn's unsupervised models, we support calling fit(X) without passing y there.

So, this should demonstrate how to use multiple fields with skorch, hope someone will find it useful.

Thanks for providing the example.

Today I was trying to use skorch together with torchtext for metric learning. For this problem, you have to pass two fields to the forward method, and y should remain empty.

I'm curious what exactly you are doing there. I implemented some metric learning approaches in the past, typically using something like a Siamese net. You could use the target to indicate which samples belong together. I moved the main logic for the metric learning to the criterion, so that the module was just returning the representations. But that might not fit your use case. And if you want to add goodies like triplet mining, it can become complicated fast (see discussion here).

kqf commented 4 years ago

I'm curious what exactly you are doing there.

If you ask about the application, it's a chatbot (there is a database with replies, so the model needs to find the most relevant one when supplied with the user query). And it's very similar to the example you mentioned. In my case, it's somewhat easier as I do hard and semi-hard negatives mining within a batch. I decided to separate the logic: I have a separate encoder-towers module and a loss module that mines hard-negatives and calculates the triplet loss. I didn't know about skorch.toy, thanks.

You could use the target

What do you mean by target? Is it Field(is_target=True)?

BenjaminBossan commented 4 years ago

You could use the target

I just meant that the y could contain, for instance, the clusters that your samples belong to, so that you could use it for mining. But it seems you already found a solution that works for you :+1:

ToddMorrill commented 4 years ago
class SkorchBucketIterator(BucketIterator):
    def __iter__(self):
        for batch in super().__iter__():
            # We make a small modification: Instead of just returning batch
            # we return batch.text and batch.label, corresponding to X and y
            yield batch.text, batch.label.long()

This is working really well. My epoch times were pretty much cut in half with this modification. Thank you for your example @BenjaminBossan.

I just tried to plug this into a grid search like the following and got an error. I'm including the traceback for reference. I can try to look into the error but I'm not too familiar with sklearn's internals. Is there a way forward here?

search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
search.fit(X=dev_dataset, y=None)

Traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-fa3b744e6e6b> in <module>
      1 search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
----> 2 search.fit(X=dev_dataset, y=None)

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    648             refit_metric = 'score'
    649 
--> 650         X, y, groups = indexable(X, y, groups)
    651         fit_params = _check_fit_params(X, fit_params)
    652 

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    246     """
    247     result = [_make_indexable(X) for X in iterables]
--> 248     check_consistent_length(*result)
    249     return result
    250 

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    206     """
    207 
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
    206     """
    207 
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    148 
    149     if hasattr(x, 'shape') and x.shape is not None:
--> 150         if len(x.shape) == 0:
    151             raise TypeError("Singleton array %r cannot be considered"
    152                             " a valid collection." % x)

TypeError: object of type 'generator' has no len()
BenjaminBossan commented 4 years ago

@ToddMorrill Thanks for reporting.

It's not quite easy for me to deduce what's going on. Could you either provide me a minimal code sample to reproduce the error or check the following things for me (by using a debugger):

  1. could you please check if the code runs when you wrap your dataset using skorch's SliceDataset?
  2. what is the type of x and x.shape in the last step?
  3. In this line:
--> 650         X, y, groups = indexable(X, y, groups)

what are the types of X and y?

  1. Here: search.fit(X=dev_dataset, y=None), what is the type of dev_dataset?
  2. When fitting without RandomizedSearchCV, everything runs fine?
ToddMorrill commented 4 years ago

I was able to reproduce it with your example by adding the following lines to the bottom of the script.

params = {'module__hidden_dim': [128, 256],
          'module__n_layers': [1, 2],
          'module__bidirectional': [False, True],
          'module__dropout': [0.2, 0.25]}

from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(net, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)

# we can set y=None because the labels are contained inside the dataset
search.fit(ds_train, y=None)

could you please check if the code runs when you wrap your dataset using skorch's SliceDataset?

X_sl = SliceDataset(ds_train)
search.fit(X_sl, y=None)

Running this results in the following output. No errors but it didn't train.

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.2s finished

what is the type of x and x.shape in the last step?

From the debugger:

type(x) == torchtext.datasets.imdb.IMDB
type(x.shape) == generator

I believe x is just type(ds_train) == torchtext.datasets.imdb.IMDB. x.shape (i.e. ds_train.shape) results in a generator. I was able to reproduce the error with the following code.

len(ds_train.shape)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-e50229cbe695> in <module>
----> 1 len(ds_train.shape)

TypeError: object of type 'generator' has no len()

In this line:

--> 650         X, y, groups = indexable(X, y, groups)

what are the types of X and y?

From the debugger:

type(X) == torchtext.datasets.imdb.IMDB
type(y) == None

Here: search.fit(X=dev_dataset, y=None), what is the type of dev_dataset?

type(dev_dataset) == torchtext.data.dataset.Dataset

When fitting without RandomizedSearchCV, everything runs fine?

Yes, it's fantastic!

BenjaminBossan commented 4 years ago

Thanks for investigating @ToddMorrill

I tracked down the weird generator error and this is the cause:

https://github.com/pytorch/text/blob/c57369cb1049b4ecb075f6f766494ed3842269d1/torchtext/data/dataset.py#L151-L154

In my opinion, this a bug on the torchtext library, since, for any unknown attribute, calling it on a dataset will return an empty generator. If the attribute is not known, they should definitely raise an AttributeError (as prescribed by the Python docs). However, for that to happen, __getattr__ should not be a generator.

Basically every code that calls

with an unknown attribute will do the wrong thing. This is especially grave with sklearn, since sklearn will at one point check hasattr(X, 'loc') or hasattr(X, 'iloc') to determine if the input is a pandas DataFrame; obviously this will cause a lot of trouble.

I tried to override their __getattr__ like this:

    def __getattr__(self, attr):
        if attr in self.fields:
            [getattr(x, attr) for x in self.examples]
        else:
            raise AttributeError("no attribute", attr)

However, then I run into the next problem, namely these lines:

https://github.com/pytorch/text/blob/c57369cb1049b4ecb075f6f766494ed3842269d1/torchtext/data/field.py#L288-L289

They basically rely on the faulty __getattr__ behavior there. At that point, I gave up, who knows how many parts if their code are still affected by this.

So overall, I'm sorry to say that you might just not be able to combine RandomizedSearchCV with the torchtext example without some serious hacking. At least I see no easy fix. But for me, this is a problem on the torchtext side and I wouldn't want to implement any fixes on the skorch side. Perhaps you can compel the torchtext devs to fix the issue but it could be hard to do that.

ToddMorrill commented 4 years ago

Thanks for that explanation @BenjaminBossan. I filed a bug with torchtext. Let's see if they pick it up.

ToddMorrill commented 4 years ago

Quick update on this. torchtext is rolling out some new design patterns that more closely mirror torch.utils.data.

This describes their plans a bit more. I'm hoping in the long run this will make torchtext more seamlessly compatible with skorch and sklearn.

BenjaminBossan commented 4 years ago

Thanks for reporting back. I read it but since I'm not familiar with torchtext, I can't really judge the changes. The general idea seems to be good. Whether it makes it easier to integrate with skorch will have to be seen.

@ToddMorrill do you have any experience with using the facilities provided by huggingface instead of torchtext? I wonder if those cooperate better with skorch. I think it could also be interesting to provide sklearn transformers to wrap their tokenizers, which would allow to integrate them into an sklearn pipeline.

ToddMorrill commented 4 years ago

I haven't had a chance to use huggingface's tools, but it's on my current project's roadmap. I'll share if I get anything running.

ToddMorrill commented 4 years ago

Hey @BenjaminBossan, quick question. Circling back to my comment above - would it be possible to use RandomizedSearchCV when y=None? I'm working on a little project that uses torch.utils.data.Dataset and torch.utils.data.DataLoader. Everything works fine with vanilla training (i.e. skorch_model.fit(train_dataset, y=None)) but when I try the same setup with search.fit(train_dataset, y=None) I got TypeError: fit() missing 1 required positional argument: 'y'. I can see that y=None is possible for unsupervised learning but naturally, my goal is supervised learning.

BenjaminBossan commented 4 years ago

@ToddMorrill could you try if one of these three proposals works for you?

1) Pass a dummy value as y with the correct shape (might not work, depending on the metric) 2) Extract your y value from your dataset (e.g. y = torch.cat([dataset[i][1] for i in range(len(dataset))]).numpy()) 3) Pass y=SliceDataset(dataset, idx=1), assuming that index 1 is your target (details)

ToddMorrill commented 4 years ago

Good thoughts!

I tried all 3 techniques and you can see the example that I'm working on for the dask team here. There's a section in this notebook titled "Grid search with Skorch" where you'll see all 3 attempts that all resulted in ValueError: Dataset does not have consistent lengths.

BenjaminBossan commented 4 years ago

Could you please paste the full stack trace for the error? I assume it's the same for all 3 cases?

ToddMorrill commented 4 years ago

Indeed, the error and stack trace were the same for all 3 cases. Here it is.

Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
    self.partial_fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
    X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
    dataset = self.get_dataset(X, y)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
    return dataset(X, y, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 165, in __init__
    len_X = get_len(X)
  File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 76, in get_len
    raise ValueError("Dataset does not have consistent lengths.")
ValueError: Dataset does not have consistent lengths.

  FitFailedWarning)
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.6472       0.6690        0.5837  1.7154
      2        0.4745       0.8010        0.4465  1.5456
BenjaminBossan commented 4 years ago

This is interesting, it looks like it works a few times and then suddenly it breaks.

Could you please initialize the net with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset) and see if that works? This parameter determines what dataset is used for the skorch internal split and, as is, the skorch.dataset.Dataset is used, which is not what you want.

After trying that, regardless of if it helps, please do the following: skorch_model = NeuralNetClassifier(..., train_split=False). This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.

ToddMorrill commented 4 years ago

FWIW, the default value for the refit parameter in RandomizedSearchCV is True, so I think the one success you're seeing might be the result of that. After setting refit=False that one success disappears.

This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.

Makes sense, thanks for the insight.

Running with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset) yields this error for all 3 approaches outlined above both with skorch_model = NeuralNetClassifier(..., train_split=skorch.dataset.CVSplit(.2)) and with skorch_model = NeuralNetClassifier(..., train_split=False).

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
    self.partial_fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
    X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
    dataset = self.get_dataset(X, y)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
    return dataset(X, y, **kwargs)
TypeError: __init__() takes 2 positional arguments but 3 were given

  FitFailedWarning)

Do you think this is because my custom TorchText class is only expecting 1 argument, namely train_dataset and not y?

BenjaminBossan commented 4 years ago

That's a bit strange:

File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset return dataset(X, y, **kwargs)

This code path should never be reached because this line comes before it:

https://github.com/skorch-dev/skorch/blob/6fe94fd042ce1200c7baa62f8ad4a1f7d7fa2bf8/skorch/net.py#L1154-L1155

Could you maybe turn on the debugger and check the value of X at that point?

BenjaminBossan commented 4 years ago

@ToddMorrill any updates?

BenjaminBossan commented 3 years ago

Since there haven't been any updates for quite a while, I assume this has been resolved. Feel free to re-open if not.