Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

Search the issues already opened: https://github.com/googleapis/java-spanner-jdbc/issues
Check for answers on StackOverflow: http://stackoverflow.com/questions/tagged/google-cloud-platform

If you are still having issues, please include as much information as possible:

Environment details

Specify the API at the beginning of the title. For example, "BigQuery: ..."). General, Core, and Other are also allowed as types
OS type and version: Mac 12.0
Java version: 20.0.1
version(s):

Steps to reproduce

Hook this connector up in a spark job with the following code:

def querySpanner[T](sqlQuery: String)(implicit spark: SparkSession, enc: org.apache.spark.sql.Encoder[T]): Dataset[T] = {
val url = "jdbc:cloudspanner:/projects/your-project-id/instances/your-instance-id/databases/your-database-id?credentials=$jsonKeyFilePath"

// Read data using Spark
val df = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "myTable")
.option("driver", "com.google.cloud.spanner.jdbc.JdbcDriver")
.load()

// Convert DataFrame to Dataset
df.as[T]
}

See that it is returning the values as if every row is the column name: +----+---+ |name| id| +----+---+ |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| +----+---+ only showing top 20 rows

My query is: "SELECT name, id FROM OrgInfoV2"

Any additional information below

where it appears to be issues with the driver

Following these steps guarantees the quickest resolution possible.

Thanks!

@dannnnthemannnn

I'm pretty sure that this is the same as for example https://github.com/sparklyr/sparklyr/issues/3196

The problem is that Spark seems to generate a query that looks like this:

select "name", "id"
from OrgInfoV2

Double quotes are used for string literals in Cloud Spanner (and BigQuery). Sparks seems to think that it is a valid way to quote column names in case any of the column names contain any spaces or are equal to reserved keywords.

A possible workaround is probably to explicitly use the MySQL dialect for your connection. MySQL uses the same type of quoting as Cloud Spanner. See https://github.com/apache/spark/blob/071feabbd4325504332679dfa620bc5ee4359370/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala#L108

googleapis / java-spanner-jdbc