open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.13k stars 975 forks source link

Support hive HTTP transport mode #7793

Closed paf91 closed 8 months ago

paf91 commented 1 year ago

Affected module Does affect Ingestion Framework

Describe the bug When hive.server2.transport.mode is set to http instead of binary, connection to hive fails with Thrift error since it expects binary. Setting thansport mode to http is necessary to work with Apache Knox proxy. More info: https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/securing-hive/topics/hive_secure_knox.html To Reproduce

Screenshots or steps to reproduce Connect to airflow scheduler pod using kubectl exec -it <pod> -- bash, then run python.

from sqlalchemy import * from sqlalchemy.engine import create_engine

engine = create_engine('hive+https://server:8443/;ssl=true;transportMode=http;httpPath=gateway/cdp-proxy-api/hive',connect_args={'ssl_cert': 'none', 'check_hostname': false}) or engine = create_engine('hive+https://server:8443/;ssl=true;transportMode=http;httpPath=gateway/cdp-proxy-api/hive',connect_args={'ssl_cert': '<cert>', 'check_hostname': false})

engine.connect() image

Expected behavior Exptected running engine.conect() without error.

Version:

Additional context Add any other context about the problem here.

gmedici commented 1 year ago

We had a similar problem trying to connect with HTTPS to hive with kerberos authentication. After many attempts we fell back on using SSL with a custom patch (described #10108).

As a suggestion I would like you to consider using impayla instead of pyhive that should cover both ssl and http/s, with or without kerberos authentication.

Needless to say that we are very interested in a definitive and stable solution to these problems. Keep up the good work!

pmbrull commented 1 year ago

@gmedici we added https://github.com/cloudera/impyla as a possible scheme, if you could try this out. Thanks

magpest commented 1 year ago

hi,

how can I test the impyla driver instead of pyhive? Is it the 'impala' scheme ?

When I try 'hive+http', it looks like it uses still pyhive as default. File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/pyhive/hive.py", line 104, in connect return Connection(*args, **kwargs) File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/pyhive/hive.py", line 249, in init response = self._client.OpenSession(open_session_req) File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 187, in OpenSession return self.recv_OpenSession() File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 199, in recv_OpenSession (fname, mtype, rseqid) = iprot.readMessageBegin() File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin name = self.trans.readAll(sz) File "/home/at/user/.conda/envs/omcli/lib/python3.10/site-packages/thrift/transport/TTransport.py", line 68, in readAll raise EOFError()