A more generalised example in docs

praateekmahajan commented 5 years ago

Problem

While the current mnist example is helpful, it doesn't hint a lot on :

how we can export our DFs to Parquet
load that DF in petastorm.

Example

A simple dataset which has 3 columns, namely x1, x2, y should be easy to load using petastorm.

I started creating an example which can be used :

parquet_path = "some_parquet_location"
num_partitions = 10

# DataFrame with four columns number, it's square, its cube and a character
df = spark.createDataFrame(
                           pd.DataFrame([(i, i**2, i**3) for i in range(100)],
                                        columns = ["x1", "x2", "y"])
)
# saving that DF as parquet so as to be consumed by a DL library
df.repartition(num_partitions).write\
  .mode("overwrite") \
  .option("compression", "none") \
  .parquet(parquet_path)

Once we have saved our DF as parquet, now we want to pass it to PyTorch/TensorFlow using PetaStorm...

train_loader = DataLoader(make_batch_reader(dataset_url='{}'.format(parquet_path),num_epochs=num_epochs),batch_size=3)
for i, data in enumerate(train_loader) : 
   print(data)
   break

While loading data it should be trivial to get random batches for eg

batch1 = [[1, 1, 1], [3, 9, 27], [8, 64, 512]]
batch2 = [[2, 4, 8], [5, 25, 125], [6, 36, 217]]

selitvin commented 5 years ago

Curious: would you add more elements to examples/hello_world/external_dataset example to cover missing features, or you would prefer to see a completely separate example?

praateekmahajan commented 5 years ago

I believe a separate example would be easier, but maybe that's besides the point.. (reason is you could see if a row is correct or not, in the example above, first element of a row is the number, second element is the square of the first, third is the cube)...

Even in the example mentioned in examples/hello_world/external_dataset, it's unclear how a batch size of 1 generates :

{'id': tensor([[5, 6, 7, 8, 9]]),
 'value1': tensor([[  65,  110,  -99, -169,    9]]),
 'value2': tensor([[ 57,  79,  21, 246, -23]])}

However it should in theory generate

{'id': tensor([[5]]),
 'value1': tensor([[  65]]),
 'value2': tensor([[ 57]])}

Looks like petastorm defines a batch differently than PyTorch where a batch would be how many rows do you want to sample from a dataset. While petastorm it looks like it means how many partition files do you want to load.

Lastly, it'll be good to see a row being shown as one tensor rather than a dict having 3 tensors.

For example, it'll be nice to see how to achieve this behaviour

batch1 = [[1, 1, 1], [3, 9, 27], [8, 64, 512]]

instead of

batch1 = {
    'x1' : [[1, 3, 8]],
    'x2' : [[1, 9, 64]],
    'x3' : [[1, 27, 512]]
}

uber / petastorm

A more generalised example in docs #389

Problem

Example