opensearch-project / opensearch-catalog

The OpenSearch Catalog is designed to make it easier for developers and community to contribute, search and install artifacts like plugins, visualization dashboards, ingestion to visualization content packs (data pipeline configurations, normalization, ingestion, dashboards).
Apache License 2.0
17 stars 18 forks source link

[BUG]Protocol column in VPC flow log parquet file is INT32, but Spark tried to read it as BIGINT #167

Open YANG-DB opened 3 days ago

YANG-DB commented 3 days ago

What is the bug? Protocol column in VPC flow log parquet file is INT32, but Spark tried to read it as BIGINT which caused streaming job failure.

org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://kmf-zero-etl-

demo/AWSLogs/aws-account-id=****/aws-service=vpcflowlogs/aws-region=us-east-2/year=2024/month=05/day=25/hour=05/****_vpcflowlogs_us-east-2_fl-*****.log.parquet. Column: [protocol], Expected: bigint, Found: INT32
    at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:724)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:397)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:227)

What is the expected behavior? VPC SQL statement definition should match the original VPC specifications:

Protocol column is INT32 in VPC doc but Athena create table uses BIGINT which I think our integration refers to

Do you have any additional context? Add any other context about the problem.