Why is the amount of data expanding so much? ScanFilterAndProjectOperator

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.34k stars 2.98k forks source link

Why is the amount of data expanding so much? ScanFilterAndProjectOperator #1755

Closed xingnailu closed 5 years ago

xingnailu commented 5 years ago

Hello, I have a query like: “select c1, c2, c3 c4, if (c5 in (x) 'hh' , c5) from table1 where c5 in ('x', 'x2') group by c1, c2,c3, c4, if (c5 in (x) 'hh' , c5)”

And The above query is used as the left table of the left join query. When I view the web ui, I found the input is "3.26GB / 2.19B rows", but the output is "Output: 79.62GB / 1.28B rows" at the ScanFilterAndProjectOperator .

I want to know , why this operator expanding data so much? thx.

Praveen2112 commented 5 years ago

@xingnailu The input datasize that ScanFilterAndProjectOperator shows the data read from the HDFS. If you use compressed columnar storage format like ORC or Parquet, then it will be less than the actual data size

xingnailu commented 5 years ago

@Praveen2112 Thanks Reply Input_format is indeed "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"。 But I have another puzzle, the ScanFilterAndProjectOperator input has 3.2G datasize, but only 2.2B rows. Does the projection pushdown work in the parquet file?

Praveen2112 commented 5 years ago

Projection pushdown works ideally, provided we don't have a nested datatype in our Parquet file. What is the overall size of the file ? How many columns are there ? BTW we have our official slack channel where you can post your questions.

xingnailu commented 5 years ago

Thanks notice. I have Joined.