rukmanigopalan / adlsguidancedoc

80 stars 30 forks source link

Data Formats: IO Patterns #3

Open davedoesdemos opened 4 years ago

davedoesdemos commented 4 years ago

read heavy and write heavy are not explained in the doc, and are too broad brush. Write heavy scenarios can favour Parquet if the writes are sequential but happen to be very heavy (this is due to the compression of the file). If the writes are heavy but "random" such as IoT then AVRO will be better. CSV files are generally better for heavy read access, while Parquet is great for loading into a database solution such as Databricks or SQL DW. For wide processing, CSV is often a better choice due to simplicity of reads.

rukmanigopalan commented 4 years ago

Great feedback, will work on this, please ack if you are interested in contributing and that is very welcome as well.

hurtn commented 4 years ago

These are all great points, I was actually looking to see if there is a standard benchmark widely accepted to be a little more scientific about these. I found the following which has some good tests that were run but concluded that "Apache Parquet and Apache Kudu achieve the best compromise between fast data ingestion, fast random data look-up and scalable data analytics." https://indico.cern.ch/event/505613/contributions/2230964/attachments/1346598/2039266/poster-200.pdf Dave the only point I would query above is whether CSV is better for read heavy compared to parquet as it is row based and uncompressed?

davedoesdemos commented 4 years ago

That's my point Nick, we need to define the nature of read heavy or it's meaningless. There are scenarios where CSV will be considerably faster, and in every scenario it is less memory and compute intensive due to the nature of parquet. The statement that parquet is the "best compromise" is meaningless without the context of the use-case. I would argue that cases where data will be row-processed will definitely benefit from CSV over parquet, which is better in mass ingest scenarios such as Polybase. We also need to remember that parquet is a very inefficient format by most measures.