uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Error while using pytorch dataloader with petastorm #523

Closed nikitamehrotra12 closed 4 years ago

nikitamehrotra12 commented 4 years ago

Hi,

Can we use petastorm pytorch to load float array. I tried doing the same but got the following error

image

selitvin commented 4 years ago

Can you please provide instructions to reproduce this case? Or are you using vanilla example from petatorm repo?

nikitamehrotra12 commented 4 years ago

Sure. Here is the code:

GCJSchema = Unischema('GCJSchema',[
                           UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
               UnischemaField('image1', np.float, (None,None), NdarrayCodec(),False),
               UnischemaField('image2', np.float, (None, None),NdarrayCodec(), False),
               UnischemaField('edge1', np.float, (2,None), NdarrayCodec(), False),
               UnischemaField('edge2', np.float, (2,None), NdarrayCodec(), False),
               UnischemaField('y', np.float, (), ScalarCodec(IntegerType()), False)
            ])
def row_generator(x):
# this function reads from text file and return array. But here for simplicity I just returned some random values
    return {'id': x,'image1': np.random.randn(128, 256, 3),'image2': np.random.randn(205,205),'edge1':np.random.randn(2,32),'edge2':np.random.randn(2, 100),'y':np.array([1])
}

and for loading the above-created dataset I used the following code:

def pytorch_GCJData(dataset_url='file:///home/nikitam/Desktop/GeniePath/GCJW2V1_Train'):
    with DataLoader(make_batch_reader(dataset_url)) as train_loader:
        sample = next(iter(train_loader))
        print("id batch: {0}".format(sample['id']))
selitvin commented 4 years ago

What happens is that you write a "Petastorm" dataset ( as opposed to a vanilla Parquet table - difference is explained here). You should be using make_reader. I tried running your example. Once I replaced make_batch_reader with make_reader, your code succeeded.

def pytorch_GCJData(dataset_url='file:///tmp/hello_world_dataset'):
    with DataLoader(make_reader(dataset_url)) as train_loader:
        sample = next(iter(train_loader))
        print("id batch: {0}".format(sample['id']))
nikitamehrotra12 commented 4 years ago

Thanks, it worked.

Also is there any way we can store a jagged array into petastorm dataset?

selitvin commented 4 years ago

I assume you mean you want to store a jagged in a field in a particular row (not having a batch with arrays of different lengths). If so, Petastorm does not support that at the moment. However, it shouldn't be too hard to implement support for an arbitrary object serialization, hence you should be able to store a list of numpy arrays or other python structure.

nikitamehrotra12 commented 4 years ago

Ok, thanks 👍