luminousmen / luminousmen.com

2 stars 0 forks source link

https://luminousmen.com/post/big-data-file-formats #1

Closed utterances-bot closed 3 weeks ago

utterances-bot commented 4 years ago

Big Data file formats

The evaluation of the major data formats and storage engines for the Big Data ecosystem has shown the pros and cons of each of them for various metrics, in this post I'll try to compare CSV, JSON, Parquet and Avro formats using Apache Spark.

https://luminousmen.com/post/big-data-file-formats

luminousmen commented 4 years ago

Testing comments

thaodt commented 4 years ago

hi @luminousmen, may I know which tool you used to draw the summary image in your post?

luminousmen commented 4 years ago

hey @thaodt, sure. I use procreate for all the illustrations on this site

I-C-Wiener commented 3 years ago

how about compressed json?

vladimirvilinski commented 3 years ago

Where is ORC? All biggest (TB) datalakes of my cusgtomers are using ORC. My expirience is (many projects by different S&P 500 customers) - ORC+BZip2 about 3-5 times smaller than Parquet+Snappy, by comparable read times.

Compression for CSV is also interesting topic... Depends on the data in CSV and compression algorythm.
I have a use case with CSV+BZip2 files (sorted mostly textual content) 30 % smaller than ORC+BZip2 and about 13 times smaller than Parquet+Snappy (same data).

luminousmen commented 3 years ago

@vladimirvilinski Cool! I personally have never used ORC in production, so your feedback is very important to me. I've never seen it used outside of the Hive, where do you use it?

fvroldan commented 3 years ago

Hi @luminousmen!

Awesome post!

I would like to check see the tables with your testing performance results but looks like the image is not loading...

Would u be so kind on sharing that information?

Thanks!

luminousmen commented 3 years ago

@fvroldan, hey, thanks! I don't see any issues actually - I believe there is some issue on your side(probably network). This is an example of the plots - https://i.imgur.com/CUUeMsz.png, try to access it