uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

A more generalised example in docs #389

Open praateekmahajan opened 5 years ago

praateekmahajan commented 5 years ago

Problem

While the current mnist example is helpful, it doesn't hint a lot on :

  1. how we can export our DFs to Parquet
  2. load that DF in petastorm.

Example

A simple dataset which has 3 columns, namely x1, x2, y should be easy to load using petastorm.

I started creating an example which can be used :

parquet_path = "some_parquet_location"
num_partitions = 10

# DataFrame with four columns number, it's square, its cube and a character
df = spark.createDataFrame(
                           pd.DataFrame([(i, i**2, i**3) for i in range(100)],
                                        columns = ["x1", "x2", "y"])
)
# saving that DF as parquet so as to be consumed by a DL library
df.repartition(num_partitions).write\
  .mode("overwrite") \
  .option("compression", "none") \
  .parquet(parquet_path)

Once we have saved our DF as parquet, now we want to pass it to PyTorch/TensorFlow using PetaStorm...

train_loader = DataLoader(make_batch_reader(dataset_url='{}'.format(parquet_path),num_epochs=num_epochs),batch_size=3)
for i, data in enumerate(train_loader) : 
   print(data)
   break

While loading data it should be trivial to get random batches for eg

batch1 = [[1, 1, 1], [3, 9, 27], [8, 64, 512]]
batch2 = [[2, 4, 8], [5, 25, 125], [6, 36, 217]]
selitvin commented 5 years ago

Curious: would you add more elements to examples/hello_world/external_dataset example to cover missing features, or you would prefer to see a completely separate example?

praateekmahajan commented 5 years ago

I believe a separate example would be easier, but maybe that's besides the point.. (reason is you could see if a row is correct or not, in the example above, first element of a row is the number, second element is the square of the first, third is the cube)...

Even in the example mentioned in examples/hello_world/external_dataset, it's unclear how a batch size of 1 generates :

{'id': tensor([[5, 6, 7, 8, 9]]),
 'value1': tensor([[  65,  110,  -99, -169,    9]]),
 'value2': tensor([[ 57,  79,  21, 246, -23]])}

However it should in theory generate

{'id': tensor([[5]]),
 'value1': tensor([[  65]]),
 'value2': tensor([[ 57]])}

Looks like petastorm defines a batch differently than PyTorch where a batch would be how many rows do you want to sample from a dataset. While petastorm it looks like it means how many partition files do you want to load.

Lastly, it'll be good to see a row being shown as one tensor rather than a dict having 3 tensors.

For example, it'll be nice to see how to achieve this behaviour

batch1 = [[1, 1, 1], [3, 9, 27], [8, 64, 512]]

instead of

batch1 = {
    'x1' : [[1, 3, 8]],
    'x2' : [[1, 9, 64]],
    'x3' : [[1, 27, 512]]
}