mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
867 stars 268 forks source link

SVHN which_format=1 crashes #241

Open cooijmanstim opened 9 years ago

cooijmanstim commented 9 years ago

Running this:

from fuel import schemes, streams, transformers, datasets
dataset = datasets.SVHN(which_format=1,
                        which_sets=["test"],
                        subset=slice(3))
stream = streams.DataStream.default_stream(
    dataset=dataset,
    iteration_scheme=schemes.SequentialScheme(3, 3))
batch = stream.get_epoch_iterator(as_dict=True).next()

Results in this error:

Traceback (most recent call last):
File "svhn_crash.py", line 8, in <module>
batch = stream.get_epoch_iterator(as_dict=True).next()
File "/u/cooijmat/.conda/envs/dev/lib/python2.7/site-packages/six.py", line 535, in next
return type(self).__next__(self)
File "/u/cooijmat/dev/fuel/fuel/iterator.py", line 32, in __next__
data = self.data_stream.get_data()
File "/u/cooijmat/dev/fuel/fuel/transformers/__init__.py", line 151, in get_data
return self.transform_batch(data)
File "/u/cooijmat/dev/fuel/fuel/transformers/__init__.py", line 183, in transform_batch
return self.transform_any(batch)
File "/u/cooijmat/dev/fuel/fuel/transformers/__init__.py", line 305, in transform_any
data=data, method=self.transform_any_source)
File "/u/cooijmat/dev/fuel/fuel/transformers/__init__.py", line 250, in _apply_sourcewise_transformation
data[i] = method(data[i], source_name)
File "/u/cooijmat/dev/fuel/fuel/transformers/__init__.py", line 414, in transform_any_source
return numpy.asarray(source_data, dtype=self.dtype)
File "/u/cooijmat/.conda/envs/dev/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

The problem is that fuel.datasets.SVHN produces ragged batches that are numpy arrays with dtype "object", and its default transformer tries to cast the arrays to floatX. Not sure what the proper solution is.

vdumoulin commented 9 years ago

Sorry about the delay! It seems like numpy doesn't like to cast numpy arrays of object dtype.

I think Cast could benefit from being more careful about how it operates: lists and numpy array of object dtype should be cast elementwise instead of being shoehorned into regular numpy arrays.

I'll try to come up with a solution for that.