Closed Keerthi9711 closed 3 weeks ago
@Keerthi9711 Apologies for the delayed response. When the COPY fails, can you please check the temporary S3 bucket for whether or not the manifest.json file is missing? Are there any other errors or exceptions found in the driver node logs? You can also check the logs on the other nodes in the cluster for exceptions or errors.
Environment setup: AWS EMR serverless 6.9.0 version Pyspark ETL job with multiple streaming queries, each streaming query writes to an iceberg table and redshift table, in microbatches, the trigger interval for microbatch is 60seconds.
Issue: In the redshift temp s3 , i see the folders with .avro files, but the manifest.json are not getting created. So the data is not being copied to redshift.
In the emr drivers logs, stderr, I see the below only few times. 24/05/01 17:41:58 INFO RedshiftWriter: Loading new Redshift data to: <<<<<>>>>>>>>>
24/05/01 17:41:58 INFO RedshiftWriter: CREATE TABLE IF NOT EXISTS <<<<>>>>>>>>
24/05/01 17:41:58 INFO RedshiftWriter: COPY <> FROM 's3://redshift-consumption-alldomains-cache/d7c1f59e-a637-472d-b7c0-254f72e97539/manifest.json' CREDENTIALS 'aws_iam_role=arn:aws:iam::12345678:role/AmazonRedshiftAllCommandsFullAccess' FORMAT AS AVRO 'auto' manifest
But most of the times my logs show 24/05/01 17:46:23 INFO RedshiftWriter: Loading new Redshift data to: <<<<<>>>>>>>>>
24/05/01 17:46:23 INFO RedshiftWriter: CREATE TABLE IF NOT EXISTS <<<<>>>>>>>>
the COPY command is not following this , i'm assuming in such times the manifest.json file is not being created.
CODE SNIPPET: