uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Add pandas dataframe API for the reader #648

Open arita37 opened 3 years ago

arita37 commented 3 years ago

Reason :
Having chunk of pandas dataframe can be very efficient when feeding model like Gradient Boosting. There is plain python row reader, but wondering if chunk based of pandas can be more efficiently done on Petastorm side.

selitvin commented 3 years ago

Its hard to compete with training from data that fits entirely into memory. Using petastorms make_batch_reader could be and option that could work for you if the whole data does not fit into memory and you need to stream it in. Of course, the devil is in the details and a lot depends on the speed of you training iteration,nature of your data and the network and compute you are using. If you have time, try setting up a prototype and share your questions/results.

arita37 commented 3 years ago

Can you output pandas dataframe chunk in batch reader ?

On Feb 16, 2021, at 15:53, Yevgeni Litvin notifications@github.com wrote:

 Its hard to compete with training from data that fits entirely into memory. Using petastorms make_batch_reader could be and option that could work for you if the whole data does not fit into memory and you need to stream it in. Of course, the devil is in the details and a lot depends on the speed of you training iteration,nature of your data and the network and compute you are using. If you have time, try setting up a prototype and share your questions/results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

selitvin commented 3 years ago

You should be able to get a pandas dataframe from the output of make_batch_reader. Here is a modified examples/hello_world/external_dataset/python_hello_world.py that shows how to do it

from petastorm import make_batch_reader
import pandas as pd

def python_hello_world(dataset_url='file:///tmp/external_dataset'):
    # Reading data from the non-Petastorm Parquet via pure Python
    with make_batch_reader(dataset_url, schema_fields=["id", "value1", "value2"]) as reader:
        for schema_view in reader:
            # make_batch_reader() returns batches of rows instead of individual rows

            # ===================================================================
            df = pd.DataFrame(schema_view._asdict())  # <---- convert a batch into a pandas dataframe

            print(df)

if __name__ == '__main__':
    python_hello_world()

Is this what you are looking for?

arita37 commented 3 years ago

Yes sounds perfect.

Just wondering, why the lmitatiom to scalare parquet ?

On Feb 19, 2021, at 3:33, Yevgeni Litvin notifications@github.com wrote:

 You should be able to get a pandas dataframe from the output of make_batch_reader. Here is a modified examples/hello_world/external_dataset/python_hello_world.py that shows how to do it

from petastorm import make_batch_reader import pandas as pd

def python_hello_world(dataset_url='file:///tmp/external_dataset'):

Reading data from the non-Petastorm Parquet via pure Python

with make_batch_reader(dataset_url, schema_fields=["id", "value1", "value2"]) as reader:
    for schema_view in reader:
        # make_batch_reader() returns batches of rows instead of individual rows

        # ===================================================================
        df = pd.DataFrame(schema_view._asdict())  # <---- convert a batch into a pandas dataframe

        print(df)

if name == 'main': python_hello_world() Is this what you are looking for?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

selitvin commented 3 years ago

Petastorm also support lists (as long as all lists are of the same size, so we can collate data into tensors efficiently).

arita37 commented 3 years ago

Thanks vm. Is it using pyarrow dataset in the backend to read the parquets ?

selitvin commented 3 years ago

Yes

Byronnar commented 3 years ago

Yes

I have a question about the batch, How can I change the size of the schema_view, My data size is 200000, get the first schema_view size is 10169, I do not know why is 10169. Thank you. @selitvin

selitvin commented 3 years ago

I have a question about the batch, How can I change the size of the schema_view, My data size is 200000, get the first schema_view size is 10169, I do not know why is 10169. The number of rows that you get from the make_batch_reader is the number of rows in a row-group you read. There is no pure Python rebatching mechanism, but you can use pytorch'a BatchedDataLoader(reader) or TF's make_petastorm_dataset(reader).unbatch().batch(your_batch_size) to produce batch of the desired size.