trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.48k stars 3.01k forks source link

Partition location does not exist in hive external table #620

Open wyukawa opened 5 years ago

wyukawa commented 5 years ago

Hi, We have some hive external tables which attach remote hdfs.

For example,

ddl

CREATE EXTERNAL TABLE hoge.piyo (
...
)
PARTITIONED BY (service STRING, yyyymmdd STRING)
STORED AS RCFILE
LOCATION 'webhdfs://...'

query

SELECT * FROM hive.hoge."piyo" WHERE service='...' AND yyyymmdd='...' LIMIT 100

We encountered the following error in prestosql 307 but this error didn't occur in 306

Caused by: io.prestosql.spi.PrestoException: Partition location does not exist: webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=...
        at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.processException(HiveFileIterator.java:163)
        at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:133)
        at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:118)
        at io.prestosql.plugin.hive.util.HiveFileIterator.getLocatedFileStatusRemoteIterator(HiveFileIterator.java:107)
        at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:100)
        at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:37)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
        at java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
        at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
        at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
        at java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
        at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
        at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:261)
        at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:93)
        at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:187)
        at io.prestosql.plugin.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
        at io.prestosql.plugin.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
        at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
        at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
martint commented 5 years ago

This may be related to https://github.com/prestosql/presto/issues/619. It's possible the query is now touching partitions that are "broken" that were previously being pruned.

electrum commented 5 years ago

I'm not clear on the problem. I see two potential issues

Can you confirm which of these is the problem?

wyukawa commented 5 years ago

These partitions are not broken because directory exists.

$ hadoop fs -ls webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=...
... webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=.../000003_0
... webhdfs://.../apps/hive/warehouse/.../service=.../yyyymmdd=.../000017_0

The webhdfs protocol works because there is no error when we create a non-partitioned external table pointing at a WebHDFS location like this.

create table test(string s)
LOAD DATA LOCAL INPATH 'test.txt'  OVERWRITE INTO TABLE default.test
CREATE EXTERNAL TABLE default.test (
str string
)
LOCATION 'webhdfs://.../apps/hive/warehouse/test'
presto:default> select * from default.test limit 10;
 str
-----
 a
 b
 c
(3 rows)
electrum commented 5 years ago

Hmm, this is strange. Can you post the full stack trace for the failure? I'm wondering if there is a sporadic WebHDFS error. We did update the Hadoop library in 307 from 2.7.7 to 3.2.0.

Does it always fail on the same partition or file system location?

wyukawa commented 5 years ago

Does it always fail on the same partition or file system location?

Yes.

Here is the full stack trace for the failure when I use a presto-cli.

ddl

CREATE EXTERNAL TABLE hoge.piyo (
...
)
PARTITIONED BY (service STRING, yyyymmdd STRING)
STORED AS RCFILE
LOCATION 'webhdfs://.../apps/hive/warehouse/piyo'

cli error example

$ java -jar presto-cli-307-executable.jar --server ... --catalog hive --schema default --debug
presto:default> SELECT * FROM hive.hoge."piyo" WHERE service='fuga' AND yyyymmdd='20170101' LIMIT 100;

Query 20190412_055342_00020_kunyf, FAILED, 1 node
http://.../ui/query.html?20190412_055342_00020_kunyf
Splits: 17 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 0% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
Peak Memory: 0B
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20190412_055342_00020_kunyf failed: Partition location does not exist: webhdfs://.../apps/hive/warehouse/piyo/service=fuga/yyyymmdd=20170101
io.prestosql.spi.PrestoException: Partition location does not exist: webhdfs://.../apps/hive/warehouse/piyo/service=fuga/yyyymmdd=20170101
    at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.processException(HiveFileIterator.java:163)
    at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:133)
    at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:118)
    at io.prestosql.plugin.hive.util.HiveFileIterator.getLocatedFileStatusRemoteIterator(HiveFileIterator.java:107)
    at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:100)
    at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:37)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
    at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
    at java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
    at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
    at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
    at java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
    at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
    at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:261)
    at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:93)
    at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:187)
    at io.prestosql.plugin.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
    at io.prestosql.plugin.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
    at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
    at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
ebyhr commented 5 years ago

Could reproduce it in my laptop using version 308 and prestodb/hdp2.6-hive:11 docker image. Exactly, partition with webhdfs throws Partition location does not exist even if it exists. Also, it happens with both managed and external table. The failed place in hadoop library is FileSystem.listLocatedStatus.

Internal call:

Created hadoop jira ticket https://issues.apache.org/jira/browse/HADOOP-16258 to ask the above diff.

hive> create table test_part (c1 int) partitioned by (dt int);
hive> insert into test_part partition(dt=1) values (1);
hive> create external table test_part_ext (c1 int) partitioned by (dt int) location 'webhdfs://hadoop-master:50070/user/hive/warehouse/test_part';
hive> alter table test_part_ext add partition (dt=1);
hive> select * from test_part_ext;
OK
1   1

presto> select * from hive.default.test_part_ext;

Query 20190416_012054_00004_zvjje failed: Partition location does not exist: webhdfs://hadoop-master:50070/user/hive/warehouse/test_part/dt=1

presto> select * from hive.default."test_part_ext$partitions";
 dt 
----
  1 

By the way, I needed to add some jar (javax.ws.rs-api and jersey-common) to use webhdfs. I got this error https://github.com/prestodb/presto/issues/6697 without the jars.

jainanuj07 commented 3 years ago

I have also faced similar problem. I have a external table over S3 storage. Initially the table was partitioned as (year month day hour) but recently i recreated table with partition as (year month day) only . Now my query are failing with error

io.prestosql.spi.PrestoException: Partition location does not exist: year=2020/month=02/day=25/hour=09 at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.processException(HiveFileIterator.java:183) at io.prestosql.plugin.hive.util.HiveFileIterator$FileStatusIterator.(HiveFileIterator.java:153) at io.prestosql.plugin.hive.util.HiveFileIterator.getLocatedFileStatusRemoteIterator(HiveFileIterator.java:117) at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:107) at io.prestosql.plugin.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:38) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) at java.base/java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:169) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300) at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:328) at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:254) at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38) at io.prestosql.$gen.Presto_346____20201123_160452_2.run(Unknown Source) at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

I do not understand from where its fetching this old partition information and why its not updated.

dendihandian commented 1 year ago

I'm facing this partition location does not exists when performing join query to write new data in idempotent way from raw table to transformed table. something like this:

INSERT INTO catalog.database_1.transformed_table
SELECT *
FROM catalog.database_1.raw_table
LEFT JOIN catalog.database_1.transformed_table ON raw_table.id = transformed_table.id
WHERE 
   raw_table.y = '2023' AND raw_table.m = '04' AND raw_table.d = '27' AND
   transformed_table.id IS NULL

I'm using Presto installed in AWS EMR Cluster. Hopefully someone could help me through this.

nguyenann13 commented 6 months ago

I am having the same issue but it only happens when I am using DBeaver to create a table. The source table isn't partitioned at all.

create table anx.spclf_workdb.memsc as select * from anx.jwantig_workdb.memtable;


io.trino.spi.TrinoException: Partition location does not exist: abfss://data@vzne2starburststage01.dfs.core.windows.net/annex/spclf_workdb.db/memsc
at io.trino.plugin.hive.BackgroundHiveSplitLoader.checkPartitionLocationExists(BackgroundHiveSplitLoader.java:650)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.createInternalHiveSplitIterator(BackgroundHiveSplitLoader.java:641)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:493)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:368)
at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:282)
at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
at io.trino.$gen.Trino_435_e_3____20240516_212345_2.run(Unknown Source)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)```