trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.25k stars 2.95k forks source link

ThriftHiveMetastoreClient not close hive client correctly could potentially cause memory leak for hivemetastore. #9346

Open boneanxs opened 3 years ago

boneanxs commented 3 years ago

We have a abstract layer, ThriftHiveMetastoreClient to wrap hive metastore client, but it seems not close hive metastore client correctly. In HiveMetaStoreClient, it first call shutdown to clean rawStore and some other threadlocal configures. can see here: https://github.com/apache/hive/blob/817ff27e96ecfce6c70c5850830e55a6e6f37da6/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L824 but we only call transport.close() to close port

    @Override
    public void close()
    {
        transport.close();
    }

can see: https://github.com/trinodb/trino/blob/fd878a6a71cb485acae9be4dbbb20c29147427a9/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/ThriftHiveMetastoreClient.java#L107

Are there any other concerns we close client like this, or might just a mistake?

findepi commented 3 years ago

If i read correctly, the question is whether we should additionally have

client.shutdown();

in io.trino.plugin.hive.metastore.thrift.ThriftHiveMetastoreClient#close.

Do you know what this would can change in practice for the metastore side? The connection is going to be closed anyway.

boneanxs commented 3 years ago

Yeah, I think we should call client.shutdown() explicitly in presto side. After client side call shutdown, HMS will shutdown rawStore(which remove PersistenceManager from PersistenceManagerFactory's cache), remove threadlocal variables.

You can find relate codes here: HMSHandler.java#L991, HMSHandler.java#L181

If we don't call shutdown, after HMSHandler is released, as PersistenceManager is not removed from cache, it would never be garbage collected, which could cause memory leak from HMS side(If the hive version is lower than 2.1). You can find a similar issue HIVE-14187(Client side not call MetaStoreClient#close, which calls shutdown to remove all cache)

findepi commented 3 years ago

Thanks for the pointers.

Seems we can call client.shutdown() and we should ignore (probably logging) any errors resulting from it. Is it possible that the call blocks though, or takes long? Especially for HMS >= 2.1, which doesn't need the call?

boneanxs commented 3 years ago

Thank you for raising your concerns.

After check the HMS codes, I don't think it could block the call, also, spark will always call shutdown when it disconnects with HMS.

I think there could be another issue we should take into consideration. We upgrade HMS version to 3.1 in our production environment, and found presto become slower when build connection with HMS. After traced some methods, it showed that HMS need to initialize rawStore when new connections coming(as lower version of HMS will not auto close rawStore, neither from presto side, so the rawStore could be cached into threadLocal and be reused).

Anyway, I think we should call client.shutdown from presto side first, and for the performance issue above, we can construct a connection pool with HMS from presto side, or raise a ticket for Hive, let them fix it.

findepi commented 3 years ago

I think there could be another issue we should take into consideration. We upgrade HMS version to 3.1 in our production environment, and found presto become slower when build connection with HMS. After traced some methods, it showed that HMS need to initialize rawStore when new connections coming(as lower version of HMS will not auto close rawStore, neither from presto side, so the rawStore could be cached into threadLocal and be reused).

cc @kokosing @electrum

boneanxs commented 2 years ago

@kokosing @electrum gentle ping...