spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

Case sensitive columns identifiers #107

Open giacomochiarella opened 2 years ago

giacomochiarella commented 2 years ago

Hi everyone! I'm approaching to Redshift and I'm using pyspark to load tables. spark-redshift jar works very well except one thing. I'm loading tables that unfortunately have some columns with the same name but different case sensitivity, like username and userName. Unfortunately I do not have any control on this as they are external sources. I've tried to use Redshift itself and it seems it can manage the case-sensitiveness. I've been able to create following table: CREATE TABLE blabla.a ( unapplied numeric(38, 18) NULL, "unApplied" numeric(38, 18) NULL ); as Redshift has this feature executing SET enable_case_sensitive_identifier TO true;. The RedshiftWriter seems do not have any check here. Would be very useful if using an option we could enable/disable this feature.

Do you know any workaround?

jsleight commented 2 years ago

From the link you provided it seems like spark-redshift would need a PR to optionally skip the check you linked. If you're up for submitting a PR, I'm happy to help review and ship it 😄 . Probably the best approach is via adding a new writer option to enable case sensitive identifiers.

As a workaround I think you could load the table into redshift with different names and then use aws commands (e.g., boto3, aws cli, etc.) to alter the redshift table afterwards (aws examples).

giacobbino commented 2 years ago

Good to hear. What I did so far is remove the toLowerCase here. To make things working. I can do the change. I have only one question. Is it enough to add the parameter to Parameter.scala and pass it to RedshiftWriter.unloadData function? Do I need to do anything to make the parameter passed from spark session into MergedParameters input map?

jsleight commented 2 years ago

Is it enough to add the parameter to Parameter.scala and pass it to RedshiftWriter.unloadData function?

I think that is enough. iirc the params get passed in to writer options in DefaultSource via subclassing one of the org.apache.spark.sql.sources

giacobbino commented 2 years ago

PR open #108