Open nqvuong1998 opened 4 months ago
cc @ayush-shah
Same issue here, it seems like the complex data type is not processed in OpenMetadata. Hue/Impala connector doesn't even process the complex data type at all, while Trino only processes it for Schema but not for Sample data/Data Profiler.
probably related to https://github.com/open-metadata/OpenMetadata/issues/15627
Hi @TeddyCr @ayush-shah @harshach , any update for this issue?
Hello @nqvuong1998 we will discuss internally and see which release can it be a part of. Until then, it would be great if you can provide us with the DDL of the table. Also, as it's open source we encourage people to contribute, let us know if you want to contribute, we will help wherever needed, Thanks 🙏
@nqvuong1998 can you share OpenMetadata version you are on and any logs you have as well as the table DDL? We could not reproduce it on our end and JSON/STRUCT field for sample data are ingested as expected
Hi @TeddyCr ,
We updated OpenMetadata to 1.5.6 (the latest version).
DDL: SHOW CREATE TABLE pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72;
CREATE TABLE pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72 (
key STRING,
payload STRUCT<
promotiontransactionid: STRING,
validto: STRING,
vouchercode: STRING,
vouchername: STRING,
flowapplied: STRING,
status: STRING,
reftransactionid: STRING,
initialoriginalamount: DECIMAL(19,2),
discountamount: DECIMAL(19,2),
initialfinalamount: DECIMAL(19,2),
initialactualamount: DECIMAL(19,2),
cuid: STRING,
contractnumber: STRING,
supplementid: STRING,
creationdate: STRING,
adjustmenthistory: ARRAY<STRUCT<
id: STRING,
refundrequestid: STRING,
refundamount: DECIMAL(19,2),
status: STRING,
refundresulttime: STRING
>>,
prcode: STRING,
campaigncode: STRING,
paymentmethod: STRING
>,
kafka_topic STRING,
kafka_partition INT,
kafka_offset BIGINT,
kafka_timestamp TIMESTAMP,
kafka_timestamp_type INT,
ingested_by STRING,
ingestion_time TIMESTAMP,
hour INT,
hash STRING
)
PARTITIONED BY (
date BIGINT
)
WITH SERDEPROPERTIES (
'partitionOverwriteMode' = 'dynamic',
'path' = 'hdfs://nameservice1/user/hive/warehouse/pmc.db/curated_pmc_promotion_transaction_prod_event_v1_sid72',
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/user/hive/warehouse/pmc.db/curated_pmc_promotion_transaction_prod_event_v1_sid72'
Logs:
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | From | Entity Name | Message | Stack Trace |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +==============+=========================================================================+==============================================+===============+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid43 | Error trying to ingest sample data for table | |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid47 | Error trying to ingest sample data for table | |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72 | Error trying to ingest sample data for table | |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] [2024-10-13 02:52:30] INFO {metadata.Utils:logger:178} - Success %: 75.0
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] [2024-10-13 02:52:30] INFO {metadata.Utils:logger:178} - Workflow finished in time: 4.0m 19.15s
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] Traceback (most recent call last):
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/bin/metadata", line 8, in <module>
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] sys.exit(metadata())
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] ^^^^^^^^^^
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/lib/python3.11/site-packages/metadata/cmd.py", line 156, in metadata
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] RUN_PATH_METHODS[metadata_workflow](path)
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/lib/python3.11/site-packages/metadata/cli/profile.py", line 51, in run_profiler
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] workflow.raise_from_status()
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/lib/python3.11/site-packages/metadata/workflow/workflow_status_mixin.py", line 134, in raise_from_status
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] raise err
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/lib/python3.11/site-packages/metadata/workflow/workflow_status_mixin.py", line 131, in raise_from_status
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] self.raise_from_status_internal(raise_warnings)
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] File "/usr/local/lib/python3.11/site-packages/metadata/workflow/ingestion.py", line 163, in raise_from_status_internal
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] raise WorkflowExecutionError(
[2024-10-13, 09:52:34 +07] {pod_manager.py:490} INFO - [base] metadata.config.common.WorkflowExecutionError: OpenMetadata reported errors: OpenMetadata Summary: [3 Records, [0 Updated Records, 0 Warnings, 3 Errors, 0 Filtered]
Expectation: OM should support show raw json format. For a dummy example:
{
"key": "abc123",
"payload": {
"promotiontransactionid": "promo_001",
"validto": "2024-12-31",
"vouchercode": "VOUCHER2024",
"vouchername": "Holiday Discount",
"flowapplied": "Purchase",
"status": "Active",
"reftransactionid": "ref_12345",
"initialoriginalamount": 100.00,
"discountamount": 20.00,
"initialfinalamount": 80.00,
"initialactualamount": 80.00,
"cuid": "cuid_67890",
"contractnumber": "CNTR2024",
"supplementid": "SUPP123",
"creationdate": "2024-10-15T10:00:00Z",
"adjustmenthistory": [
{
"id": "adj_001",
"refundrequestid": "rr_001",
"refundamount": 10.00,
"status": "Refunded",
"refundresulttime": "2024-10-16T11:00:00Z"
},
{
"id": "adj_002",
"refundrequestid": "rr_002",
"refundamount": 5.00,
"status": "Pending",
"refundresulttime": "2024-10-17T12:00:00Z"
}
],
"prcode": "PRCODE2024",
"campaigncode": "CMP2024",
"paymentmethod": "Credit Card"
},
"kafka_topic": "promotion_events",
"kafka_partition": 1,
"kafka_offset": 123456,
"kafka_timestamp": "2024-10-15T10:05:00Z",
"kafka_timestamp_type": 0,
"ingested_by": "user_001",
"ingestion_time": "2024-10-15T10:06:00Z",
"hour": 10,
"hash": "abcdef1234567890",
"date": 20241015
}
Can you share the full log files (if you can run it with Debug that would be helpful). Feel free to DM it to me in our slack channel. I see 3 errors in there -- would be interested to see what it is.
Hi @TeddyCr @ayush-shah ,
Trino Profiler config:
source:
type: trino
serviceName: trino_bdp
serviceConnection:
config:
type: Trino
hostPort: $TRINO_HOST_PORT
username: $TRINO_USERNAME
authType:
# For basic auth
password: $TRINO_PASSWORD
catalog: bdp
connectionArguments:
verify: /data/ca.pem
sourceConfig:
config:
type: Profiler
generateSampleData: true
sampleDataCount: 70
computeMetrics: false
profileSampleType: PERCENTAGE
profileSample: 100
processPiiSensitive: false
confidence: 80
threadCount: 5
timeoutSeconds: 43200
includeViews: false
schemaFilterPattern:
includes:
- ^pmc$
processor:
type: orm-profiler
config: {}
sink:
type: metadata-rest
config: {}
workflowConfig:
loggerLevel: DEBUG
openMetadataServerConfig:
hostPort: $OM_HOST_PORT
authProvider: openmetadata
securityConfig:
jwtToken: $OM_JWT_TOKEN
## Store the service Connection information
storeServiceConnection: false
## If SSL, fill the following
verifySSL: validate
sslConfig:
caCertificate: /data/ca.pem
Full log: trino_profiler_log.txt
Is your feature request related to a problem? Please describe. When ingesting sample data from Hive tables using Trino, we encounter an error: "Error trying to ingest sample data for table" when dealing with tables that have complex data types.
Describe the solution you'd like There are 2 solutions: