prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.97k stars 5.35k forks source link

Test velox parquet using presto unit test #22496

Open qqibrow opened 5 months ago

qqibrow commented 5 months ago

We’ve developed a test solution to test velox velox native parquet reader using presto unit tests. we want to discuss whether we can add that in CI in presto project. There is already a discussion thread in velox https://github.com/facebookincubator/velox/discussions/9392. According to Masha, prestodb seems a better place for this test.

Expected Behavior or Use Case

Test coverage in velox parquet is not enough. Testing velox parquet using presto unit test is the most practical approach for now.

Presto Component, Service, or Connector

presto-parquet unit test

Possible Implementation

Check high level implementation and draft PR here https://github.com/facebookincubator/velox/discussions/9392

Example Screenshots (if appropriate):

Context

For more details please check https://github.com/facebookincubator/velox/discussions/9392.

tdcmeehan commented 5 months ago

There's already precedent for adding end to end integration tests for Prestissimo. I think it makes sense for there to be end to end integration tests for this in Presto, just as we have end to end integration tests that test specific functions (even though they have micro coverage in Velox as well).

qqibrow commented 5 months ago

@tdcmeehan thanks! could you share issues or PR example where to put such test? thanks!

tdcmeehan commented 5 months ago

Here are some examples of end to end tests which were added recently:

IEEE functions: https://github.com/prestodb/presto/pull/22469 CTAS tess: https://github.com/prestodb/presto/pull/20936

CC: @majetideepak @aditi-pandit @yingsu00

yingsu00 commented 3 months ago

@qqibrow Hi Lu, the Presto native tests are at presto-native-execution/src/test/ directory as @tdcmeehan pointed out. Thanks @tdcmeehan ! Some other examples are like TestPrestoNativeAggregations, TestPrestoNativeWindowQueries, etc.
But these tests need to build a native query runner and run the SQL queries from end to end. For our purpose to test the Velox Parquet reader, we don't want to run the SQL queries from end to end, but just reuse Presto existing unit tests.

I think it's ok to put the tests in com/facebook/presto/hive/parquet as in your original PR [Testing] [Don't review] Velox parquet test use presto as the tests are in com/facebook/presto/hive/parquet/ParquetTester.java, or put it also in presto-native-execution/src/test/ . I don't have very strong opinion on whether to choose one over the other. @aditi-pandit What's your recommendation?

But I also noticed the Parquet code and tests are spreaded in both presto-hive and presto-parquet. I don't know why this is the case, but we can do some refactoring to merge them into once place later. @zhenxiao Do you know why? Do you have any recommendation where we should put the native Parquet reader tests?

aditi-pandit commented 3 months ago

@yingsu00 : Adding the tests in com/facebook/presto/hive/parquet as in the PR https://github.com/prestodb/presto/pull/21444/files#top looks good. I just had a quick look at private static void assertNativeReaderFileContents(..) function. That approach should be fine.

zhenxiao commented 3 months ago

thank you @yingsu00 the placement of Parquet tests is a legacy issue. Initially, the old parquet reader(which has been totally removed) is part of presto-hive, so all testcases are in presto-hive. After we developed a new parquet reader(probably in 2016 or 2017?), we move parquet related code to presto-parquet, and added parquet tests there. But there are still some legacy dependency, make it challenging to separate parquet from hive cleanly. It is a good idea to merge all of them into one place, say, presto-parquet. I think the presto parquet testcases are extensive and have good coverage. Re-use them for velox would be awesome