open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.55k stars 1.05k forks source link

Hive Table has no partition field #14479

Open win-shy opened 10 months ago

win-shy commented 10 months ago

When I add a Hive Service pull the schemas of hive table. But The table Schema has no partition field of hive. for example:

CREATE EXTERNAL TABLE `dw.XXX`(
  `id` string
PARTITIONED BY ( `year` int, `month` int, `day` int)

The fields of year 、month and day are all not being displayed on the Schema Page.

I wonder if this is a bug or is it just designed so. Thanks

darshanik commented 3 months ago

@win-shy @SumanMaharana @harshach

Per my observation. The Hive Ingestion Dialect for metastore (assuming that your metastore is in PostgreSQL) the existing method "_get_table_columns" is not querying partitions.

image

Link to above method just FYI: https://github.com/open-metadata/OpenMetadata/blob/1.4.5-release/ingestion/src/metadata/ingestion/source/database/hive/metastore_dialects/postgres/dialect.py

And the query is existing in latest releases along with the older one 1.3.1 (which I'm using)

If the query in the method is modified to something like following, the method can retrieve the partitions as well as the table fields in single shot:

select db_name, table_name, column_name from ( SELECT d."NAME" as db_name, t."TBL_NAME" as table_name, c."COLUMN_NAME" as column_name, c."INTEGER_IDX" as idx FROM "DBS" d JOIN "TBLS" t on t."DB_ID" = d."DB_ID" JOIN "SDS" s on t."SD_ID" = s."SD_ID" JOIN "COLUMNS_V2" c on c."CD_ID" = s."CD_ID" WHERE d."NAME" = '{schema}' and t."TBL_NAME" = '{table}' union SELECT d."NAME" as db_name, t."TBL_NAME" as table_name, k."PKEY_NAME" as column_name, 50000 + k."INTEGER_IDX" FROM "DBS" d JOIN "TBLS" t on t."DB_ID" = d."DB_ID" join "PARTITION_KEYS" k on t."TBL_ID" = k."TBL_ID" where d."NAME" = '{schema}' and t."TBL_NAME" = '{table}' ) x order by db_name, table_name, idx

Current Problem:

The reason for missing partition fields in the tables metadata is they are not existing as table columns in hive metastore.

Because it is the design and working nature of Hive which can identify partition fields and need not be defined as table fields.

As the partition is nothing but the directory location for Hive in the distributed file system to quickly lookup the specific data from a specific range instead of loading the whole data there by saving a lot of time and computation.

So when Openmetadata ingests tables from hive-metastore, the query it launches does not take into account the partition fields but only the table fields

image

The Actual Schema of the table:

image

Solution:

Query in get table column method may be modified to yield all fields with partitions such as below:

image