heyodai / cs5540-papers

Repo to hold group notes on CS 5540 papers
0 stars 0 forks source link

Conduct experiments to compare storage costs and compute costs #8

Open heyodai opened 1 year ago

heyodai commented 1 year ago

Description: Conduct experiments to compare the storage costs and compute costs of FPHLM using the current data storage format (CSV) and the best-performing alternative data storage format (Parquet), as identified in Ticket #7

Tasks:

sbrunton1 commented 1 year ago

The paper 2020-AComparisonofHDFSFileFormatsAvroParquetandORC does an excellent job setting up mock data to answer the following requests. They then import this data to HDFS with Hive and perform multiple tests to answer those requests.

RQ1: Which file format consumes less storage space of HDFS? RQ2: Which data structure format (Avro, Parquet, or ORC) supports high performance with regards to aggregated/scanned queries? RQ3: Which data format (Avro or Parquet or ORC) is more compact?

Would it be worth looking at these test cases and approach to help us establish our methodology and how to measure success in relation to our datasets?

heyodai commented 1 year ago

Paper, for reference: 2020-AComparisonofHDFSFileFormatsAvroParquetandORC-annotated.pdf

heyodai commented 1 year ago

@sbrunton1 - I've read the paper and I agree with you. These are good questions and the paper provides compelling evidence that ORC is the better choice in most cases.

I will say that I think RQ1 and RQ3 are basically the same question, unless I'm missing something.