Open findepi opened 4 years ago
It might be that we simply run more tests on HDP3.
e.g. TestHiveTransactionalTable
.
Does it take so much more time?
In suite-2 executed on non-HDP3 environment, TestHiveTransactionalTable.testReadFullAcid
and TestHiveTransactionalTable.testReadInsertOnly
are skipped.
Due to these tests being in storage_formats
test group as well as hive_transactional
group, they are being executed in the second, third and fourth run-launcher
invocation in suite-2.
Each of these invocations runs each of the two tests in 8 different configurations of bucketing type and partitioning.
This adds about 10 minutes to each run-launcher
invocation.
In suite-3 the cause is the same - transactional tests are skipped in non-HDP3 envs.
However, this suite only executes the storage_formats
test group once, so the difference isn't as significant.
I suggest removing these tests from the storage_formats
test group.
Alternatively we can add hive_transactional
to the excluded groups in these invocations, but in this solution the same problem might still happen in other test suites.
Another, but smaller difference I noticed was increased HDP3 environment setup time - prestodev/hdp3.1-hive:28
docker image needed 4m53s to start vs. 2m20s for prestodev/hdp2.6-hive:28
This difference adds up with each run-launcher
invocation.
thanks @MiguelWeezardo for the analysis.
Let's
TestHiveTransactionalTable
tests from storage_formats
.TestHiveTransactionalTable
that will be storage_formats
and will exercise a full ACID partitioned table with default bucketing (just one configuration)
-- this is necessary to maintain test coverage for ACID tables with kerberized Metastore and/or HDFS,
with impersonation off/on.TestHiveTransactionalTable
.
While we definitely must be testing partitioned and non-partitioned and partitioned tables,
as well as all bucketing type options, we do not ned to test all combinations thereof.
It should be sufficient to have:
{ full acid | insert only } x { partitioned | not partitioned } x { no bucketing | default bucketing }
{ full acid } x { not partitioned } x { bucketing v1 | bucketing v2 }
@losipiuk @wendigo please verify this proposal
@findepi keep in mind that Caching tests transactions as part of storage_formats
group. We need to continue testing caching with transactions.
@findepi keep in mind that Caching tests transactions as part of
storage_formats
group. We need to continue testing caching with transactions.
We can explicitly add HIVE_TRANSACTIONAL
for caching related test run
thanks @MiguelWeezardo for the analysis.
Let's
- remove all
TestHiveTransactionalTable
tests fromstorage_formats
.- add a new test method in
TestHiveTransactionalTable
that will bestorage_formats
and will exercise a full ACID partitioned table with default bucketing (just one configuration) -- this is necessary to maintain test coverage for ACID tables with kerberized Metastore and/or HDFS, with impersonation off/on.let's reduce or remove matrix usage from
TestHiveTransactionalTable
. While we definitely must be testing partitioned and non-partitioned and partitioned tables, as well as all bucketing type options, we do not ned to test all combinations thereof. It should be sufficient to have:
{ full acid | insert only } x { partitioned | not partitioned } x { no bucketing | default bucketing }
{ full acid } x { not partitioned } x { bucketing v1 | bucketing v2 }
@losipiuk @wendigo please verify this proposal
I would go with:
{ full acid } x { partitioned } x { bucketing v1 | bucketing v2 }
- as this is a bit more nasty usecase.
The other way to approach it would be to drop storage_formats
from runs on:
singlenode-kerberos-hdfs-no-impersonation
singlenode-hdfs-impersonation
singlenode-kerberos-hdfs-impersonation
And instead tag only one of tests in TestHiveTransactionTable
(as well as all relevant tests which were in storage_formats in different classes) with hdfs_impersonation
and hdfs_no_impersonation
.
Ideally we would only use hdfs_impersonation
/hdfs_no_impersonation
for above runs. That way we would be sure that we do not get any tests duplication and that only relevant tests are run. Though it would require more care when assigning groups for new tests.
The other way to approach it would be to drop
storage_formats
from runs on:
singlenode-kerberos-hdfs-no-impersonation
singlenode-hdfs-impersonation
singlenode-kerberos-hdfs-impersonation
And instead tag only one of tests in
TestHiveTransactionTable
(as well as all relevant tests which were in storage_formats in different classes) withhdfs_impersonation
andhdfs_no_impersonation
. Ideally we would only usehdfs_impersonation
/hdfs_no_impersonation
for above runs. That way we would be sure that we do not get any tests duplication and that only relevant tests are run. Though it would require more care when assigning groups for new tests.
That's good idea, but I want this to be considered a follow-up.
The reason why kerberized envs have storage_formats
run is because storage formats
exercise different code paths, and this is the way to identify all the missing doAs's.
@losipiuk please create an issue to track this.
I would go with:
{ full acid } x { partitioned } x { bucketing v1 | bucketing v2 }
- as this is a bit more nasty usecase.
@losipiuk you mean this instead of all i proposed? or instead of part of it?
We can explicitly add HIVE_TRANSACTIONAL for caching related test run
Good idea. However, I think it should be enough to add hive_caching
to the
one test method I singled out that should be kept in storage_formats
:
add a new test method in TestHiveTransactionalTable that will be storage_formats
I can take the action here.
I would go with:
{ full acid } x { partitioned } x { bucketing v1 | bucketing v2 }
- as this is a bit more nasty usecase.@losipiuk you mean this instead of all i proposed? or instead of part of it?
i see now. This was instead of { full acid } x { partitioned } x { bucketing v1 | bucketing v2 }
The reason why i think non-partitioned is enough, is that default bucketing is generally "v2" (hive 3+). So we actually test the partitioned v2 case.
I want to test remaining bucketing options in a non-partitioned case, becasue i expect this be faster, without decrease of coverage.
I've coded my proposal -- https://github.com/prestosql/presto/pull/4332
We could do more, however:
testReadFullAcid
test executes on Hive: INSERT, INSERT, INSERT, MINOR compaction, DELETE, UPDATE, MAJOR compaction
.
Even after the above PR, we run this 6 times (in various configurations) as part of hive_transactional
group and 1 time more in storage_formats
group. It feels redundant.testReadInsertOnly
Not sure if related but just got a timeout related failure here https://github.com/trinodb/trino/pull/16205/checks?check_run_id=12768456819
suite-1
it's faster on hdp3
suite-2
suite-3