Closed nikitamehrotra12 closed 4 years ago
Can you please provide instructions to reproduce this case? Or are you using vanilla example from petatorm repo?
Sure. Here is the code:
GCJSchema = Unischema('GCJSchema',[
UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
UnischemaField('image1', np.float, (None,None), NdarrayCodec(),False),
UnischemaField('image2', np.float, (None, None),NdarrayCodec(), False),
UnischemaField('edge1', np.float, (2,None), NdarrayCodec(), False),
UnischemaField('edge2', np.float, (2,None), NdarrayCodec(), False),
UnischemaField('y', np.float, (), ScalarCodec(IntegerType()), False)
])
def row_generator(x):
# this function reads from text file and return array. But here for simplicity I just returned some random values
return {'id': x,'image1': np.random.randn(128, 256, 3),'image2': np.random.randn(205,205),'edge1':np.random.randn(2,32),'edge2':np.random.randn(2, 100),'y':np.array([1])
}
and for loading the above-created dataset I used the following code:
def pytorch_GCJData(dataset_url='file:///home/nikitam/Desktop/GeniePath/GCJW2V1_Train'):
with DataLoader(make_batch_reader(dataset_url)) as train_loader:
sample = next(iter(train_loader))
print("id batch: {0}".format(sample['id']))
What happens is that you write a "Petastorm" dataset ( as opposed to a vanilla Parquet table - difference is explained here). You should be using make_reader
. I tried running your example. Once I replaced make_batch_reader
with make_reader
, your code succeeded.
def pytorch_GCJData(dataset_url='file:///tmp/hello_world_dataset'):
with DataLoader(make_reader(dataset_url)) as train_loader:
sample = next(iter(train_loader))
print("id batch: {0}".format(sample['id']))
Thanks, it worked.
Also is there any way we can store a jagged array into petastorm dataset?
I assume you mean you want to store a jagged in a field in a particular row (not having a batch with arrays of different lengths). If so, Petastorm does not support that at the moment. However, it shouldn't be too hard to implement support for an arbitrary object serialization, hence you should be able to store a list of numpy arrays or other python structure.
Ok, thanks 👍
Hi,
Can we use petastorm pytorch to load float array. I tried doing the same but got the following error