spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

Exception More than Int.MaxValue elements #147

Open YuelongCai opened 10 months ago

YuelongCai commented 10 months ago

https://github.com/spark-redshift-community/spark-redshift/blob/20e7ccbb17b06be26dad9c5e894f92dee0845fa5/src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftRelation.scala#L125

Caused by: java.lang.IllegalArgumentException: More than Int.MaxValue elements. at scala.collection.immutable.NumericRange$.check$1(NumericRange.scala:318) at scala.collection.immutable.NumericRange$.count(NumericRange.scala:328) at scala.collection.immutable.NumericRange.numRangeElements$lzycompute(NumericRange.scala:53) at scala.collection.immutable.NumericRange.numRangeElements(NumericRange.scala:52) at scala.collection.immutable.NumericRange.length(NumericRange.scala:55) at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:143)

YuelongCai commented 10 months ago

val countQuery = s"SELECT count(*) FROM $tableNameOrSubquery $whereClause" If the countQuery return result more than Int.MaxValue, then use this number to construct a RDD[Row] will cause exception. Refer to spark source code https://github.com/apache/spark/blob/a073bf38c7d8802e2ab12c54299e1541a48a394e/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L143 when call nr.length which is expected to return Int, but exceed int.MaxValue.

YuelongCai commented 10 months ago

The possible solution is

if the total count is too big, like N to split it into multiple rdds, each should be less than int.MaxValue and union all to generate rdd with size N

for example, sc.parallelize(1L to 1000000000L, 200).union(sc.parallelize(1L to 1000000000L, 200)).union(sc.parallelize(1L to 1000000000L, 200)).union(sc.parallelize(1L to 1000000000L, 200)) will avoid this issue

bsharifi commented 9 months ago

Thank you @YuelongCai for reporting this limitation and suggestions on how to work around it. While we will consider addressing this in a future release, please feel free to submit a PR and we will be happy to review and merge it.