spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
134 stars 62 forks source link

Error when storing JSON data in SUPER column #156

Open vnktsh opened 5 months ago

vnktsh commented 5 months ago

Hi, does this package support Redshift's SUPER datatype?

I'm inserting a valid JSON data into SUPER column and getting following error:

java.sql.SQLException:
Error (code 1224) while loading data into Redshift: "Format with multiple values without array or object"
Table name: "PUBLIC"."telemetry_data"
Column name: telemetry
Column type: super(16384000)
Raw line: "{"created":"2024-03-16T18:22:56.258Z","deviceSerial":"5500TX-0000-0036","deviceId":"000799aa5b66","bundle":[{"category":"ExtTlsTelemetry","created":"2024-03-16T10:30:00.445Z","data":{"sslTierStats":{"rxSSLMbps":0,"rxSSLMaxMbps":0,"rxSSLPktsPerSec":0,"rxSSLMaxPktsPerSec":0,"txSSLMbps":0,"txSSLMaxMbps":0,"txSSLPktsPerSec":0,"txSSLMaxPktsPerSec":0},"npSslI...
Raw field value: "{"created":"2024-03-16T18:22:56.258Z","deviceSerial":"55036","deviceId":"00066","bundle":[{"category":"ExtTlsTelemetry","created":"2024-03-16T10:30:00.445Z","data":{"sslTierStats":{"rxSSLMbps":0,"rxSSLMaxMbps":0,"rxSSLPktsPerSec":0,"rxSSLMaxPktsPerSec":0,"txSSLMbps":0,"txSSLMaxMbps":0,"txSSLPktsPerSec":0,"txSSLMaxPktsPerSec":0},"npSslInspTrafficStats":{"clientBytesIn":0,"clientPacketsIn":0,"serverBytesIn":0,"serverPacketsIn":0,"clientBytesToD":0,"clientPacketsToD":0,"clientBytesToI":0,"clientPacketsToI":0,"serverBytesToD":0,"serverPacketsToD":0,"serverBytesToI":0,"serverPacketsToI":0},"npSslInspStats":{"sslConnections":0,"sslRuleConnections":0,"sslNoRuleConnections":0,"inspectedSessions":0,"shuntedSessions":0,"blockedMaxSslConnections":0,"allowedMaxSslConnections":0,"maxSessions":0,"percentMaxSessions":0,"blockedCriticalBigHeapUse":0,"allowedCriticalBigHeapUse":0},"sslConcurrentConnections":0,"sslProxyConfig":{"nUniqueCerts":0,"nUniqueCidrs":0,"serverAddresses":[],"certificateInfo":{}}}}]}"

  at io.github.spark_redshift_community.spark.redshift.RedshiftWriter.$anonfun$doRedshiftLoad$2(RedshiftWriter.scala:200)

This is the main error: Error (code 1224) while loading data into Redshift: "Format with multiple values without array or object"

This is the data I'm trying to insert (valid json string):

{"created":"2024-03-16T18:22:56.258Z","deviceSerial":"55036","deviceId":"00066","bundle":[{"category":"ExtTlsTelemetry","created":"2024-03-16T10:30:00.445Z","data":{"sslTierStats":{"rxSSLMbps":0,"rxSSLMaxMbps":0,"rxSSLPktsPerSec":0,"rxSSLMaxPktsPerSec":0,"txSSLMbps":0,"txSSLMaxMbps":0,"txSSLPktsPerSec":0,"txSSLMaxPktsPerSec":0},"npSslInspTrafficStats":{"clientBytesIn":0,"clientPacketsIn":0,"serverBytesIn":0,"serverPacketsIn":0,"clientBytesToD":0,"clientPacketsToD":0,"clientBytesToI":0,"clientPacketsToI":0,"serverBytesToD":0,"serverPacketsToD":0,"serverBytesToI":0,"serverPacketsToI":0},"npSslInspStats":{"sslConnections":0,"sslRuleConnections":0,"sslNoRuleConnections":0,"inspectedSessions":0,"shuntedSessions":0,"blockedMaxSslConnections":0,"allowedMaxSslConnections":0,"maxSessions":0,"percentMaxSessions":0,"blockedCriticalBigHeapUse":0,"allowedCriticalBigHeapUse":0},"sslConcurrentConnections":0,"sslProxyConfig":{"nUniqueCerts":0,"nUniqueCidrs":0,"serverAddresses":[],"certificateInfo":{}}}}]}

This is my corresponding DDL for table PUBLIC.telemetry_data:

create table public.telemetry_data
(
    telemetry     super encode zstd
);

Environment details

Package version:

        <dependency>
            <groupId>io.github.spark-redshift-community</groupId>
            <artifactId>spark-redshift_2.12</artifactId>
            <version>6.2.0-spark_3.5</version>
        </dependency>
melin commented 4 months ago

Also encountered a similar error, is there a solution? STL_LOAD_ERRORS.csv

melin commented 4 months ago

Similar to sql, the execution of glue elt(spark) was successful, but the migration to serverless failed. glue dynamically registers schemas. How does super automatically register schema?

cc @bsharifi

vannguyende commented 1 month ago

the same issue. Is there any solution?

bsharifi commented 1 month ago

@vnktsh You can check the public docs for examples on how to use the SUPER data type with the connector: