Closed xingnailu closed 5 years ago
@xingnailu The input datasize that ScanFilterAndProjectOperator
shows the data read from the HDFS. If you use compressed columnar storage format like ORC or Parquet, then it will be less than the actual data size
@Praveen2112 Thanks Reply
Input_format is indeed "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"。
But I have another puzzle, the ScanFilterAndProjectOperator
input has 3.2G datasize, but only 2.2B rows.
Does the projection pushdown work in the parquet file?
Projection pushdown works ideally, provided we don't have a nested datatype in our Parquet file. What is the overall size of the file ? How many columns are there ? BTW we have our official slack channel where you can post your questions.
Thanks notice. I have Joined.
Hello, I have a query like: “select c1, c2, c3 c4, if (c5 in (x) 'hh' , c5) from table1 where c5 in ('x', 'x2') group by c1, c2,c3, c4, if (c5 in (x) 'hh' , c5)”
And The above query is used as the left table of the left join query. When I view the web ui, I found the input is "3.26GB / 2.19B rows", but the output is "Output: 79.62GB / 1.28B rows" at the ScanFilterAndProjectOperator .
I want to know , why this operator expanding data so much? thx.