libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.8k stars 180 forks source link

Unexpected pipeline behavior with NDArrayField #271

Open rmchurch opened 1 year ago

rmchurch commented 1 year ago

I don't think this is a bug, but as a new user it surprised me, perhaps its documented but if not perhaps should be. I define a dataset like so:

fshape = (1,31,32)
writer = DatasetWriter(write_path, {
    'data': NDArrayField(shape=fshape, dtype=np.dtype('float64')),
    'target': NDArrayField(shape=fshape, dtype=np.dtype('float64')),
    'vol': NDArrayField(shape=(1,fshape[-1]), dtype=np.dtype('float64')),
    'temp': NDArrayField(shape=(1,), dtype=np.dtype('float64')),

}, num_workers=64) 

After writing the .beton file, I at first tried creating a loader using the same pipeline

float_pipeline = [NDArrayDecoder(), ToTensor()]

# Pipeline for each data field
pipelines = {
    'data': float_pipeline,
    'target': float_pipeline,
    'vol': float_pipeline,
    'temp': float_pipeline
}       

loader = Loader(ffcv_file, batch_size=64, num_workers=8,
                order=OrderOption.RANDOM, pipelines=pipelines)
data,target,vol,temp = next(iter(loader))

However, all of the variables have the shape of the smallest array, in this case temp (i.e. data is shape (Nbatch,1), where it should be (Nbatch,1,31,32)).

When I create separate pipelines for each variable which is a different size, then things come out correctly:

float_pipeline = [NDArrayDecoder(), ToTensor()]
vol_pipeline = [NDArrayDecoder(), ToTensor()]
T_pipeline = [NDArrayDecoder(), ToTensor()]

# Pipeline for each data field
pipelines = {
    'data': float_pipeline,
    'target': float_pipeline,
    'vol': vol_pipeline,
    'temp': T_pipeline
}