Closed econti closed 3 years ago
Don't know about the specifics of this in skorch but generally you need to add padding / perform slicing so every sample has the same length. The only exception of this are Tensorflow's Ragged Tensors, but even then you have to specify a default value to pad with when converting to regular tensors (Pytorch doesn't have Ragged Tensors yet).
@econti Could you check whether PackedSequence
solves your issue?
Otherwise, we have an example here that shows how to potentially deal with variable length sequences.
Thanks @BenjaminBossan, that did the trick for me. Leaving a code snippet here for anyone else who encounters a similar issue:
# data["X_id_list"] is a pandas dataframe that hold variable length lists of lists, e.g.
# [[1, 3], [0, 40, 16], ...]
X_id_list = {}
for series_name, series in data["X_id_list"].iteritems():
pre_pad = [torch.tensor(i) for i in series]
X_id_list[series_name] = pad_sequence(
pre_pad, batch_first=True, padding_value=0
)
@econti Great that you found a solution and thanks for the snippet.
I'm facing a similar issue right now and I suspect I'm doing the same thing that you're doing, which is padding to the longest sequence length in the dataset, which results in significantly more computation than would result from padding at the batch level. I suspect we need something like a collate_fn that operates at the batch level to solve this the right way.
@ToddMorrill I don't know the exact details of your case, so maybe I'm missing something. In general through, collate_fn
is designed to work on samples and not on batches. If you want to avoid any costly operation on each sample, you would have to provide your own DataLoader
. You can pass it as iterator_train
and iterator_valid
to NeuralNet
in skorch.
However, this is not the canonical way o fdealing with sequences of different lengths. Maybe you can make use of PackedSequence
or pad_sequence
.
"A custom collate_fn can be used to customize collation, e.g., padding sequential data to max length of a batch." Source That's what I'm trying to do.
Thanks for pointing me toward NeuralNet
. I was using NeuralNetClassifier
and totally missed the opportunity to use a custom DataLoader
. I'll give that a shot.
I'm not opposed to using pad_sequence
, it's just that I got started with torchtext and it was already doing a fantastic job taking care of all my text preprocessing needs, including padding, so I didn't want to rewrite that functionality.
To be sure, I'm trying to reuse the following torchtext code with skorch.
import torchtext
from torchtext import data
from torchtext import datasets
# set up fields
TEXT = data.Field(lower=True, batch_first=True, )
LABEL = data.Field(sequential=False, unk_token=None)
# takes approx. 10 minutes to download data and embeddings (will be cached for re-use)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# will be used to initialize model embeddings layer
vocab = torchtext.vocab.GloVe(name='6B', dim=100)
# build the vocabulary
max_size = 25_000 # shorten for demonstrative purposes
TEXT.build_vocab(train, vectors=vocab, max_size=max_size)
LABEL.build_vocab(train)
# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_sizes=(32, 64), device='cpu')
So far, I haven't found a way to reuse train_iter
with Skorch. train_iter
is used in a for loop and yields batches of data padded to the longest length sequence in the batch. It also buckets batches by sequence length to reduce computation. Each batch has a .text
and a .label
attribute that contain the numericalized data and label representation, respectively.
I welcome any suggestions on recycling this code.
My apologies for all the posts but I just wanted to share a quick update before signing off and ask a question.
I created a custom dataset and then implemented a custom collate_fn
as follows:
def pad_batch(batch):
text, label = list(zip(*batch))
padded_batch = pad_sequence(text, batch_first=True, padding_value=1)
return padded_batch, torch.cat(label)
skorch_model = NeuralNet(
CNN,
device=device,
max_epochs=2,
lr=0.001,
optimizer=optim.Adam,
criterion=nn.NLLLoss,
iterator_train__collate_fn=pad_batch,
iterator_train__shuffle=True,
iterator_valid__collate_fn=pad_batch,
iterator_valid__shuffle=False,
train_split=skorch.dataset.CVSplit(.2), # NB: this witholds 20% of the training data for validation
module__n_filters=100,
module__filter_sizes=(2,3,4),
module__dropout=0.2,
module__pretrained_embeddings=TEXT.vocab.vectors,
batch_size=32,
verbose=2)
skorch_model.fit(train_dataset)
What's amazing about padding at the batch level is that run times went from 60 seconds per epoch to 20 seconds per epoch - a huge improvement. However, I was liking all of the functionality I had while using NeuralNetClassifier
, namely all of the scoring functions. NeuralNetClassifier
insists on having skorch_model.fit(X, y)
and fails with skorch_model.fit(train_dataset)
. Do you have a way around this so that I can use NeuralNetClassifier
with my custom dataset and custom dataloader?
I'm still interested in recycling the torchtext functionality so if you have thoughts on that, I still welcome them!!
Thanks for all of your help! I'm loving skorch.
Thanks for all of your help! I'm loving skorch.
That's great to hear, thanks.
Thanks for pointing me toward
NeuralNet
. I was usingNeuralNetClassifier
and totally missed the opportunity to use a customDataLoader
. I'll give that a shot.
Sorry that I have confused you, you can do the same thing with NeuralNetClassifier
, I just used NeuralNet
as a stand in for all the derived classes.
NeuralNetClassifier
insists on havingskorch_model.fit(X, y)
and fails withskorch_model.fit(train_dataset)
It depends a bit. What does your target look like? Potentially, it could be possible to extract it and pass it as y
. But that only really makes sense if you work on a (multiclass) classification problem -- is that the case for your dataset? If you want to do, say, seq2seq, I don't see how that can work with NeuralNetClassifier
.
namely all of the scoring functions
Note that you can use the scoring functions also with NeuralNet
, have a look at EpochScoring
.
I'm making progress on my example text classification pipeline using NeuralNetClassifier
. Have a look here. I managed to recycle the useful parts of torchtext (e.g. TEXT.process(batch)
, etc.) but did indeed have to use a custom collate_fn
inside of DataLoader
. Most importantly to me, run times have been reduced dramatically. I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext. I'll bet torch.nn.utils.rnn.pack_padded_sequence would be helpful here, as you pointed out @BenjaminBossan, but it just requires me to implement more functionality. The bottom line is I was hoping to make use if torchtext's functionality from start to finish. That does not appear to be possible with Skorch at this stage. If there is anything I can do to help make this possible, please let me know.
I believe it makes a lot of sense to make skorch work with popular libraries like torchtext and torchvision. When we released skorch, the former didn't exist yet, so now we might be in a place where not everything works together. However, there might still be a way. I would need to look more thoroughly at what torchtext provides and see what we can do, once I have a bit of time.
@ToddMorrill please keep us up-to-date if you find some better solution.
Hi guys,
Sorry for the noise if it's not actual anymore. But I wasn't able to find any usage of skorch + torchtext
, and this is the only thread that bumps up in google.
@ToddMorrill
I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext.
I have good news for you :D skorch
supports pytorch
datasets, the same convention is followed by torchtext
. In fact, all their datasets are inherited from torch.utils.data.Dataset
. In theory, this makes them compatible with skorch
.
As for me, it's a beautiful example of great design and implementation. Both teams followed the same conventions imposed by pytorch
and ended up with two independent libraries that are compatible with each other.
Here I prepared a short example (somewhat similar to the one provided by @ToddMorrill ) how to integrate torchtext
into skorch
pipeline:
import torch
import skorch
import random
import numpy as np
import pandas as pd
from torchtext.data import BucketIterator, Example, Dataset, Field, LabelField
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
SEED = 137
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
def data(size=1000):
return pd.DataFrame({
"query": ["This is a duck", "This is a goose"] * size,
"target": [0, 1] * size,
})
class TextPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self, fields, need_vocab=None):
self.fields = fields
self.need_vocab = need_vocab or {}
def fit(self, X, y=None):
dataset = self.transform(X, y)
for field, min_freq in self.need_vocab.items():
field.build_vocab(dataset, min_freq=min_freq)
return self
def transform(self, X, y=None):
proc = [X[col].apply(f.preprocess) for col, f in self.fields]
examples = [Example.fromlist(f, self.fields) for f in zip(*proc)]
return Dataset(examples, self.fields)
def build_preprocessor():
text_field = Field(lower=True)
label_field = LabelField(is_target=True)
fields = [
('query', text_field),
('target', label_field),
]
return TextPreprocessor(fields, need_vocab={text_field: 0, label_field: 0})
class SimpleModule(torch.nn.Module):
def __init__(self, vocab_size=100, emb_dim=16, lstm_hidden_dim=32):
super().__init__()
self._emb = torch.nn.Embedding(vocab_size, emb_dim)
self._rnn = torch.nn.LSTM(emb_dim, lstm_hidden_dim)
self._out = torch.nn.Linear(lstm_hidden_dim, 2)
def forward(self, inputs):
rnn_output = self._rnn(self._emb(inputs))[0]
return torch.nn.functional.softmax(self._out(rnn_output[-1]))
class InputShapeSetter(skorch.callbacks.Callback):
def on_train_begin(self, net, X, y):
# NB: If your module relies on pretrained embeddings
# net.set_params(module__embeddings=X.fields["query"].vocab.vectors)
pass
def build_model():
model = skorch.NeuralNetClassifier(
module=SimpleModule,
iterator_train=BucketIterator,
iterator_valid=BucketIterator,
train_split=Dataset.split,
callbacks=[InputShapeSetter()],
)
full = make_pipeline(
build_preprocessor(),
model
)
return full
def main():
df = data()
assert type(df) == pd.DataFrame
dataset = build_preprocessor().fit_transform(df)
assert type(dataset) == Dataset
# Putting it all together
model = build_model().fit(
df, # pd.DataFrame, torchtext handles X and y
0.7 # <<< ?? This sets split_ratio for Dataset.split
)
print(model.predict(df))
assert model.score(df, df["target"]) > 0.5, "Fitting issues"
if __name__ == '__main__':
main()
This code should work with the latest versions of the libraries. The only strange thing is that you have to pass split_ratio=0.7
through .fit
method. I guess, this side effect is caused by this line in the skorch
code. Perhaps, there's a better solution for this.
@BenjaminBossan It looks like you are a member of the dev team. Probably 594 is somehow related to the topic with torchtext
. If you will raise an error on IterableDataset
then you will lose this torchtext
support. I might be wrong.
Once again sorry for spamming.
@kqf Thanks for posting the example, I'm taking a look at it. At the end of the day, I think it would be nice to add a notebook that showcases how to use torchtext. Ideally, it should use one of the torchtext datasets like IMDB and pretrained embeddings.
The only strange thing is that you have to pass
split_ratio=0.7
through.fit
method
Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:
from functools import partial
def my_train_split(dataset, y, split_ratio):
return dataset.split(split_ratio=split_ratio)
...
def build_model():
model = skorch.NeuralNetClassifier(
module=SimpleModule,
iterator_train=BucketIterator,
iterator_valid=BucketIterator,
train_split=partial(my_train_split, split_ratio=0.7),
callbacks=[InputShapeSetter()],
)
...
model = build_model().fit(df) # no need to pass split_ratio here
@BenjaminBossan
it should use one of the torchtext datasets like IMDB and pretrained embeddings.
It's totally doable, I didn't want to download the data/embeddings on my private laptop.
Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:
Yes, I agree, but that was the one of my intentions: to demonstrate that skorch
is compatible with torchtext
without extra code and and to show a strange skoch
behaviour. I would expect if I pass.fit(X, y=None)
then y
will not be passed to the split function. I think it should be handled on skorch
side and it deserves an issue on it's own 🤷
What do you think?
It's totally doable, I didn't want to download the data/embeddings on my private laptop.
Yes, what you posted is a really good starting point.
without extra code
I think those two lines are acceptable :)
I would expect if I pass
.fit(X, y=None)
theny
will not be passed to the split function.
I think that could make sense. Do you want to work on this change?
In the meantime, I tried to implement a torchtext example with skorch that's a bit closer to a real world problem someone could have. It uses skorch with torchtext and BERT (via huggingface). Here is the notebook:
@kqf @ToddMorrill since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field
, TEXT
, LABEL
, and build_vocab
stuff does. For reference, my notebook is basically a re-implementation of this notebook.
The main change that I had to introduce was to slightly change BucketIterator
:
class SkorchBucketIterator(BucketIterator):
def __iter__(self):
for batch in super().__iter__():
# We make a small modification: Instead of just returning batch
# we return batch.text and batch.label, corresponding to X and y
yield batch.text, batch.label.long()
skorch basically really wants to always have an X
and a y
, because this is what sklearn expects. With the shown change, we get that. (I didn't quite get why batch.label
is int32
, surely there is a better way to change that.) Apart from this, I could re-use most of the code from the original notebook.
ping @ottonemo maybe this is also interesting for you.
I think that could make sense. Do you want to work on this change?
Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.
since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field, TEXT, LABEL, and build_vocab stuff does. For reference, my notebook is basically a re-implementation of this notebook.
I am not an expert in torchtext
either, but your code looks fine. Those TEXT
and LABEL
are the instances of the field class. Fields are "applied" to examples to extract the information needed. The fields define all necessary transformations, and build_vocab
is similar to .fit
method for transformers (so you have to apply it to the train data only).
I like the way you are handling torchtext.data.Batch
. It's really a good one.
skorch basically really wants to always have an X and a y, because this is what sklearn expects
I think this is important what you are saying. The default NeuralNet
was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense add UnsupervisedNeuralNet
or something like this. I think this still will be compatible with sklearn
as they have support for clustering and Manifold learning.
@BenjaminBossan One more thing about examples with torchtext
and it is directly related to the issue. Today I was trying to use skorch
together with torchtext
for metric learning. For this problem, you have to pass two fields to the forward
method, and y
should remain empty. I will not provide the full example here, as it may be a bit lengthy, but, probably it will be useful to have a notebook that shows how to achieve that?
In any event, if you have to pass multiple fields to forward
method, you have to do two modifications:
from operator import attrgetter
def batch2dict(batch): return {f: attrgetter(f)(batch) for f in batch.input_fields}
class SkorchBucketIterator(BucketIterator): def iter(self): for batch in super().iter():
# we return dict() and empty tensor, corresponding to X and y
yield batch2dict(batch), torch.empty(0)
2. You have to use `Field(batch_first=True)` option when creating the fields, otherwise `skorch ` will complain about the inconsistent length of the dataset
So, this should demonstrate how to use multiple fields with `skorch`, hope someone will find it useful.
Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.
No problem at all. If you need help along the way, just ask.
The default
NeuralNet
was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense addUnsupervisedNeuralNet
or something like this. I think this still will be compatible withsklearn
as they have support for clustering and Manifold learning.
NeuralNetClassifier
, NeuralNetBinaryClassifier
, and NeuralNetRegressor
are explicitly modeled to be for supervised learning. NeuralNet
is more open-ended and should be used for anything unsupervised. As with sklearn's unsupervised models, we support calling fit(X)
without passing y
there.
So, this should demonstrate how to use multiple fields with
skorch
, hope someone will find it useful.
Thanks for providing the example.
Today I was trying to use
skorch
together withtorchtext
for metric learning. For this problem, you have to pass two fields to theforward
method, andy
should remain empty.
I'm curious what exactly you are doing there. I implemented some metric learning approaches in the past, typically using something like a Siamese net. You could use the target
to indicate which samples belong together. I moved the main logic for the metric learning to the criterion, so that the module was just returning the representations. But that might not fit your use case. And if you want to add goodies like triplet mining, it can become complicated fast (see discussion here).
I'm curious what exactly you are doing there.
If you ask about the application, it's a chatbot (there is a database with replies, so the model needs to find the most relevant one when supplied with the user query). And it's very similar to the example you mentioned. In my case, it's somewhat easier as I do hard and semi-hard negatives mining within a batch. I decided to separate the logic: I have a separate encoder-towers module and a loss module that mines hard-negatives and calculates the triplet loss. I didn't know about skorch.toy
, thanks.
You could use the
target
What do you mean by target
? Is it Field(is_target=True)
?
You could use the
target
I just meant that the y
could contain, for instance, the clusters that your samples belong to, so that you could use it for mining. But it seems you already found a solution that works for you :+1:
class SkorchBucketIterator(BucketIterator): def __iter__(self): for batch in super().__iter__(): # We make a small modification: Instead of just returning batch # we return batch.text and batch.label, corresponding to X and y yield batch.text, batch.label.long()
This is working really well. My epoch times were pretty much cut in half with this modification. Thank you for your example @BenjaminBossan.
I just tried to plug this into a grid search like the following and got an error. I'm including the traceback for reference. I can try to look into the error but I'm not too familiar with sklearn's internals. Is there a way forward here?
search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
search.fit(X=dev_dataset, y=None)
Traceback
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-fa3b744e6e6b> in <module>
1 search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
----> 2 search.fit(X=dev_dataset, y=None)
~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
648 refit_metric = 'score'
649
--> 650 X, y, groups = indexable(X, y, groups)
651 fit_params = _check_fit_params(X, fit_params)
652
~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
246 """
247 result = [_make_indexable(X) for X in iterables]
--> 248 check_consistent_length(*result)
249 return result
250
~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
206 """
207
--> 208 lengths = [_num_samples(X) for X in arrays if X is not None]
209 uniques = np.unique(lengths)
210 if len(uniques) > 1:
~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
206 """
207
--> 208 lengths = [_num_samples(X) for X in arrays if X is not None]
209 uniques = np.unique(lengths)
210 if len(uniques) > 1:
~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
148
149 if hasattr(x, 'shape') and x.shape is not None:
--> 150 if len(x.shape) == 0:
151 raise TypeError("Singleton array %r cannot be considered"
152 " a valid collection." % x)
TypeError: object of type 'generator' has no len()
@ToddMorrill Thanks for reporting.
It's not quite easy for me to deduce what's going on. Could you either provide me a minimal code sample to reproduce the error or check the following things for me (by using a debugger):
SliceDataset
?x
and x.shape
in the last step?--> 650 X, y, groups = indexable(X, y, groups)
what are the types of X
and y
?
search.fit(X=dev_dataset, y=None)
, what is the type of dev_dataset
?RandomizedSearchCV
, everything runs fine?I was able to reproduce it with your example by adding the following lines to the bottom of the script.
params = {'module__hidden_dim': [128, 256],
'module__n_layers': [1, 2],
'module__bidirectional': [False, True],
'module__dropout': [0.2, 0.25]}
from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(net, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
# we can set y=None because the labels are contained inside the dataset
search.fit(ds_train, y=None)
could you please check if the code runs when you wrap your dataset using skorch's SliceDataset?
X_sl = SliceDataset(ds_train) search.fit(X_sl, y=None)
Running this results in the following output. No errors but it didn't train.
Fitting 5 folds for each of 2 candidates, totalling 10 fits [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total= 0.0s [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total= 0.0s [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True [CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total= 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.2s finished
what is the type of
x
andx.shape
in the last step?
From the debugger:
type(x) == torchtext.datasets.imdb.IMDB
type(x.shape) == generator
I believe x
is just type(ds_train) == torchtext.datasets.imdb.IMDB
. x.shape
(i.e. ds_train.shape
) results in a generator. I was able to reproduce the error with the following code.
len(ds_train.shape)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-e50229cbe695> in <module>
----> 1 len(ds_train.shape)
TypeError: object of type 'generator' has no len()
In this line:
--> 650 X, y, groups = indexable(X, y, groups)
what are the types of
X
andy
?
From the debugger:
type(X) == torchtext.datasets.imdb.IMDB
type(y) == None
Here:
search.fit(X=dev_dataset, y=None)
, what is the type ofdev_dataset
?
type(dev_dataset) == torchtext.data.dataset.Dataset
When fitting without
RandomizedSearchCV
, everything runs fine?
Yes, it's fantastic!
Thanks for investigating @ToddMorrill
I tracked down the weird generator
error and this is the cause:
In my opinion, this a bug on the torchtext
library, since, for any unknown attribute, calling it on a dataset will return an empty generator. If the attribute is not known, they should definitely raise an AttributeError
(as prescribed by the Python docs). However, for that to happen, __getattr__
should not be a generator.
Basically every code that calls
hasattr(dataset, attr)
foo = getattr(dataset, attr, None); if foo ...
try: dataset.foo ... except AttributeError: ...
with an unknown attribute will do the wrong thing. This is especially grave with sklearn, since sklearn will at one point check hasattr(X, 'loc') or hasattr(X, 'iloc')
to determine if the input is a pandas DataFrame
; obviously this will cause a lot of trouble.
I tried to override their __getattr__
like this:
def __getattr__(self, attr):
if attr in self.fields:
[getattr(x, attr) for x in self.examples]
else:
raise AttributeError("no attribute", attr)
However, then I run into the next problem, namely these lines:
They basically rely on the faulty __getattr__
behavior there. At that point, I gave up, who knows how many parts if their code are still affected by this.
So overall, I'm sorry to say that you might just not be able to combine RandomizedSearchCV
with the torchtext example without some serious hacking. At least I see no easy fix. But for me, this is a problem on the torchtext side and I wouldn't want to implement any fixes on the skorch side. Perhaps you can compel the torchtext devs to fix the issue but it could be hard to do that.
Thanks for that explanation @BenjaminBossan. I filed a bug with torchtext
. Let's see if they pick it up.
Quick update on this. torchtext
is rolling out some new design patterns that more closely mirror torch.utils.data
.
This describes their plans a bit more. I'm hoping in the long run this will make torchtext
more seamlessly compatible with skorch
and sklearn
.
Thanks for reporting back. I read it but since I'm not familiar with torchtext, I can't really judge the changes. The general idea seems to be good. Whether it makes it easier to integrate with skorch will have to be seen.
@ToddMorrill do you have any experience with using the facilities provided by huggingface instead of torchtext? I wonder if those cooperate better with skorch. I think it could also be interesting to provide sklearn transformers to wrap their tokenizers, which would allow to integrate them into an sklearn pipeline.
I haven't had a chance to use huggingface's tools, but it's on my current project's roadmap. I'll share if I get anything running.
Hey @BenjaminBossan, quick question. Circling back to my comment above - would it be possible to use RandomizedSearchCV
when y=None
? I'm working on a little project that uses torch.utils.data.Dataset
and torch.utils.data.DataLoader
. Everything works fine with vanilla training (i.e. skorch_model.fit(train_dataset, y=None)
) but when I try the same setup with search.fit(train_dataset, y=None)
I got TypeError: fit() missing 1 required positional argument: 'y'
. I can see that y=None
is possible for unsupervised learning but naturally, my goal is supervised learning.
@ToddMorrill could you try if one of these three proposals works for you?
1) Pass a dummy value as y
with the correct shape (might not work, depending on the metric)
2) Extract your y
value from your dataset (e.g. y = torch.cat([dataset[i][1] for i in range(len(dataset))]).numpy()
)
3) Pass y=SliceDataset(dataset, idx=1)
, assuming that index 1 is your target (details)
Good thoughts!
I tried all 3 techniques and you can see the example that I'm working on for the dask team here. There's a section in this notebook titled "Grid search with Skorch" where you'll see all 3 attempts that all resulted in ValueError: Dataset does not have consistent lengths.
Could you please paste the full stack trace for the error? I assume it's the same for all 3 cases?
Indeed, the error and stack trace were the same for all 3 cases. Here it is.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
self.partial_fit(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
self.fit_loop(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
dataset = self.get_dataset(X, y)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
return dataset(X, y, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 165, in __init__
len_X = get_len(X)
File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 76, in get_len
raise ValueError("Dataset does not have consistent lengths.")
ValueError: Dataset does not have consistent lengths.
FitFailedWarning)
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.6472 0.6690 0.5837 1.7154
2 0.4745 0.8010 0.4465 1.5456
This is interesting, it looks like it works a few times and then suddenly it breaks.
Could you please initialize the net with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset)
and see if that works? This parameter determines what dataset is used for the skorch internal split and, as is, the skorch.dataset.Dataset
is used, which is not what you want.
After trying that, regardless of if it helps, please do the following: skorch_model = NeuralNetClassifier(..., train_split=False)
. This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.
FWIW, the default value for the refit
parameter in RandomizedSearchCV
is True
, so I think the one success you're seeing might be the result of that. After setting refit=False
that one success disappears.
This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.
Makes sense, thanks for the insight.
Running with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset)
yields this error for all 3 approaches outlined above both with skorch_model = NeuralNetClassifier(..., train_split=skorch.dataset.CVSplit(.2))
and with skorch_model = NeuralNetClassifier(..., train_split=False)
.
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
self.partial_fit(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
self.fit_loop(X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
X, y, **fit_params)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
dataset = self.get_dataset(X, y)
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
return dataset(X, y, **kwargs)
TypeError: __init__() takes 2 positional arguments but 3 were given
FitFailedWarning)
Do you think this is because my custom TorchText
class is only expecting 1 argument, namely train_dataset
and not y
?
That's a bit strange:
File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
return dataset(X, y, **kwargs)
This code path should never be reached because this line comes before it:
Could you maybe turn on the debugger and check the value of X
at that point?
@ToddMorrill any updates?
Since there haven't been any updates for quite a while, I assume this has been resolved. Feel free to re-open if not.
I'm having a hard time figuring out how to pass a list of lists (with variable length) to skorch's
fit
method.Specifically, I have a feature that is a list of ID's (e.g.
[[1, 12, 3], [6, 22]...]
) which are converted to a dense representation using an embedding table in my PyTorch module'sforward
method:When I call
net.fit()
on my data set (e.g.{"X_float": ..., "X_id_list": ...}
I get the following error caused by the list of lists:I've also tried converting the list of lists to a pandas dataframe and numpy array (of objects) and neither works. How do you handle variable length lists of lists in
skorch.fit
?