Open praateekmahajan opened 5 years ago
Curious: would you add more elements to examples/hello_world/external_dataset
example to cover missing features, or you would prefer to see a completely separate example?
I believe a separate example would be easier, but maybe that's besides the point.. (reason is you could see if a row is correct or not, in the example above, first element of a row is the number, second element is the square of the first, third is the cube)...
Even in the example mentioned in examples/hello_world/external_dataset
, it's unclear how a batch size of 1 generates :
{'id': tensor([[5, 6, 7, 8, 9]]),
'value1': tensor([[ 65, 110, -99, -169, 9]]),
'value2': tensor([[ 57, 79, 21, 246, -23]])}
However it should in theory generate
{'id': tensor([[5]]),
'value1': tensor([[ 65]]),
'value2': tensor([[ 57]])}
Looks like petastorm
defines a batch differently than PyTorch
where a batch would be how many rows do you want to sample from a dataset. While petastorm
it looks like it means how many partition files
do you want to load.
Lastly, it'll be good to see a row being shown as one tensor rather than a dict having 3 tensors.
For example, it'll be nice to see how to achieve this behaviour
batch1 = [[1, 1, 1], [3, 9, 27], [8, 64, 512]]
instead of
batch1 = {
'x1' : [[1, 3, 8]],
'x2' : [[1, 9, 64]],
'x3' : [[1, 27, 512]]
}
Problem
While the current
mnist
example is helpful, it doesn't hint a lot on :Example
A simple dataset which has 3 columns, namely x1, x2, y should be easy to load using petastorm.
I started creating an example which can be used :
Once we have saved our DF as parquet, now we want to pass it to PyTorch/TensorFlow using PetaStorm...
While loading data it should be trivial to get random batches for eg