Open wyukawa opened 5 years ago
This may be related to https://github.com/prestosql/presto/issues/619. It's possible the query is now touching partitions that are "broken" that were previously being pruned.
I'm not clear on the problem. I see two potential issues
WHERE
clause, and are not being filtered.webhdfs
protocol no longer works. If this is the case, it should be easy to repo by creating a non-partitioned external table pointing at a WebHDFS location.Can you confirm which of these is the problem?
These partitions are not broken because directory exists.
$ hadoop fs -ls webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=...
... webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=.../000003_0
... webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=.../000017_0
The webhdfs protocol works because there is no error when we create a non-partitioned external table pointing at a WebHDFS location like this.
create table test(string s)
LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE default.test
CREATE EXTERNAL TABLE default.test (
str string
)
LOCATION 'webhdfs://.../apps/hive/warehouse/test'
presto:default> select * from default.test limit 10;
str
-----
a
b
c
(3 rows)
Hmm, this is strange. Can you post the full stack trace for the failure? I'm wondering if there is a sporadic WebHDFS error. We did update the Hadoop library in 307 from 2.7.7
to 3.2.0
.
Does it always fail on the same partition or file system location?
Does it always fail on the same partition or file system location?
Yes.
Here is the full stack trace for the failure when I use a presto-cli.
ddl
CREATE EXTERNAL TABLE hoge.piyo (
...
)
PARTITIONED BY (service STRING, yyyymmdd STRING)
STORED AS RCFILE
LOCATION 'webhdfs://.../apps/hive/warehouse/piyo'
cli error example
$ java -jar presto-cli-307-executable.jar --server ... --catalog hive --schema default --debug
presto:default> SELECT * FROM hive.hoge."piyo" WHERE service='fuga' AND yyyymmdd='20170101' LIMIT 100;
Query 20190412_055342_00020_kunyf, FAILED, 1 node
http://.../ui/query.html?20190412_055342_00020_kunyf
Splits: 17 total, 0 done (0.00%)
CPU Time: 0.0s total, 0 rows/s, 0B/s, 0% active
Per Node: 0.0 parallelism, 0 rows/s, 0B/s
Parallelism: 0.0
Peak Memory: 0B
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20190412_055342_00020_kunyf failed: Partition location does not exist: webhdfs://.../apps/hive/warehouse/piyo/service=fuga/yyyymmdd=20170101
io.prestosql.spi.PrestoException: Partition location does not exist: webhdfs://.../apps/hive/warehouse/piyo/service=fuga/yyyymmdd=20170101
at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.processException(HiveFileIterator.java:163)
at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:133)
at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:118)
at io.prestosql.plugin.hive.util.HiveFileIterator.getLocatedFileStatusRemoteIterator(HiveFileIterator.java:107)
at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:100)
at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:37)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
at java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
at java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:261)
at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:93)
at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:187)
at io.prestosql.plugin.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at io.prestosql.plugin.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Could reproduce it in my laptop using version 308 and prestodb/hdp2.6-hive:11 docker image. Exactly, partition with webhdfs throws Partition location does not exist
even if it exists. Also, it happens with both managed and external table. The failed place in hadoop library is FileSystem.listLocatedStatus
.
Internal call:
Created hadoop jira ticket https://issues.apache.org/jira/browse/HADOOP-16258 to ask the above diff.
hive> create table test_part (c1 int) partitioned by (dt int);
hive> insert into test_part partition(dt=1) values (1);
hive> create external table test_part_ext (c1 int) partitioned by (dt int) location 'webhdfs://hadoop-master:50070/user/hive/warehouse/test_part';
hive> alter table test_part_ext add partition (dt=1);
hive> select * from test_part_ext;
OK
1 1
presto> select * from hive.default.test_part_ext;
Query 20190416_012054_00004_zvjje failed: Partition location does not exist: webhdfs://hadoop-master:50070/user/hive/warehouse/test_part/dt=1
presto> select * from hive.default."test_part_ext$partitions";
dt
----
1
By the way, I needed to add some jar (javax.ws.rs-api and jersey-common) to use webhdfs. I got this error https://github.com/prestodb/presto/issues/6697 without the jars.
I have also faced similar problem. I have a external table over S3 storage. Initially the table was partitioned as (year month day hour) but recently i recreated table with partition as (year month day) only . Now my query are failing with error
io.prestosql.spi.PrestoException: Partition location does not exist: year=2020/month=02/day=25/hour=09
at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.processException(HiveFileIterator.java:183)
at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.
I do not understand from where its fetching this old partition information and why its not updated.
I'm facing this partition location does not exists
when performing join query to write new data in idempotent way from raw table to transformed table. something like this:
INSERT INTO catalog.database_1.transformed_table
SELECT *
FROM catalog.database_1.raw_table
LEFT JOIN catalog.database_1.transformed_table ON raw_table.id = transformed_table.id
WHERE
raw_table.y = '2023' AND raw_table.m = '04' AND raw_table.d = '27' AND
transformed_table.id IS NULL
I'm using Presto installed in AWS EMR Cluster. Hopefully someone could help me through this.
I am having the same issue but it only happens when I am using DBeaver to create a table. The source table isn't partitioned at all.
create table anx.spclf_workdb.memsc as select * from anx.jwantig_workdb.memtable;
io.trino.spi.TrinoException: Partition location does not exist: abfss://data@vzne2starburststage01.dfs.core.windows.net/annex/spclf_workdb.db/memsc
at io.trino.plugin.hive.BackgroundHiveSplitLoader.checkPartitionLocationExists(BackgroundHiveSplitLoader.java:650)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.createInternalHiveSplitIterator(BackgroundHiveSplitLoader.java:641)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:493)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:368)
at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:282)
at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
at io.trino.$gen.Trino_435_e_3____20240516_212345_2.run(Unknown Source)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)```
Hi, We have some hive external tables which attach remote hdfs.
For example,
ddl
query
We encountered the following error in prestosql 307 but this error didn't occur in 306