uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Petastorm PyTorch dataloader slower than JSON #519

Open kaiwenw opened 4 years ago

kaiwenw commented 4 years ago

I'm converting my JSON readers to use Petastorm to be more scalable. However, I found that the Petastorm Dataloader takes around 20x the time to iterate through an dataset as their pandas JSON counterparts (when epoch=1). Around ~40% of the total epoch time comes from the 0th iteration on the dataloader, and then every iteration after that is about 1.3x slower than the pandas data iterator.

I found this to be true for both datasets of 3000 and 100,000 rows.

Here is my benchmark python file. benchmark.txt

Has anyone else encountered this issue and know a workaround? It seems similar to Issue #443 but that one is still unresolved. @selitvin

selitvin commented 4 years ago

make_reader is an API that was created for cases when your row-size is huge, e.g. in our case, we were storing multiple images in a row.

make_batch_reader is a more appropriate API for the kind of data that you are experimenting with. I took a liberty to update your benchmark to use that API.

...and yet, Petastorm turns out to be faster, at least on my machine :)

json total took 2.240082025527954 secs
petastorm total took 0.49458885192871094 secs

Petastorm implements a worker pool that enables you to achieve high throughput over network with higher latency. In this case, when you read from a local filesystem, this advantage might not be observable, especially when you read from one relatively small file, as OS will likely prefetch a lot of data for you (or perhaps would just load it all into memory before pandas even tries to access it all.

Note that Parquet is superior that json when it comes to columnar data access. That being said, there is a price to pay in performance, so if your scenario does not necessary need this (or other Parquet) features, you might be better off using different data formats.

kaiwenw commented 4 years ago

Thanks @selitvin!

I tried using make_batch_reader instead, but slightly different from your script since I wanted the output sizes to all have given batch_size: see here. On my machine, JSON took 2.4s and Petastorm/parquet took 3s, which is much better than before but still a bit confusing: 1) What overhead of parquet would make it 25% slower than JSON? Isn't parquet suitable for fast reading? 2) The Petastorm Dataloader uses 10 workers, rather than Pandas which I presume uses 1. I guess true parallelism isn't possible with threads are used (because of GIL) but shouldn't having more concurrent workers still yield faster runtimes for I/O intensive tasks like reading a file (since a thread waiting for I/O can context switch to other thread and the latency can be spread out)?

Thanks!

selitvin commented 4 years ago

The code stacks you go through is very different. In order to do a fair comparison, you would need to compare, say, reading data from 10TB json vs Parquet file, assuming a distributed, network filesystem with high latency: e.g. hdfs and s3. With local json file, your os or your disk controller is likely prefetching the entire json file into memory before you start parsing it.

The reading structure of Parquet / Petastorm is geared towards the later case.