libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.83k stars 178 forks source link

ValueError: could not broadcast input array from shape #103

Closed carmocca closed 2 years ago

carmocca commented 2 years ago
import os

import numpy as np
from ffcv.fields import NDArrayField
from ffcv.writer import DatasetWriter

class LinearRegressionDataset:
    def __init__(self, N, d):
        self.X = np.random.randn(N, d)
        self.Y = np.random.randn(N)

    def __getitem__(self, idx):
        # FIXME: returning a tuple is forced here
        return self.X[idx].astype('float32')

    def __len__(self):
        return len(self.X)

def run():
    cwd = os.getcwd()
    dataset_path = os.path.join(cwd, "random")

    # https://docs.ffcv.io/writing_datasets.html#writing-a-dataset-to-ffcv-format
    N, d = 100, 6
    dataset = LinearRegressionDataset(N, d)
    fields = {"covariate": NDArrayField(shape=(d,), dtype=np.dtype("float32"))}
    writer = DatasetWriter(dataset_path, fields)
    writer.from_indexed_dataset(dataset)

if __name__ == "__main__":
    run()
  File "/home/carlos/miniconda3/envs/ffcv/lib/python3.8/site-packages/ffcv/fields/ndarray.py", line 94, in encode
    data_region[:] = field.reshape(-1).view('<u1')
ValueError: could not broadcast input array from shape (4,) into shape (24,)

The cause is that NDArrayField requires that the output of __getitem__ is a tuple:

-        return self.X[idx].astype('float32')
+        return (self.X[idx].astype('float32'),)

resolves the issue.

Is this a bug or a hard-requirement?

GuillaumeLeclerc commented 2 years ago

__getitem__ need to return an object (Tuple, ndarray, list...) such that it is indexable and has a length equal to the number of fields specified in the DatasetWriter so that each field can be associated with the corresponding data. Here you have 1 field so, the length should be 1 but

self.X[idx].astype('float32')

has a length of d so it won't work as the first dimension (d) will be interpreted as the field dimension.