googleapis / java-spanner-jdbc

Apache License 2.0
64 stars 48 forks source link

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

Open dannnnthemannnn opened 1 year ago

dannnnthemannnn commented 1 year ago

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please include as much information as possible:

Environment details

  1. Specify the API at the beginning of the title. For example, "BigQuery: ..."). General, Core, and Other are also allowed as types
  2. OS type and version: Mac 12.0
  3. Java version: 20.0.1
  4. version(s):

Steps to reproduce

  1. Hook this connector up in a spark job with the following code:

    def querySpanner[T](sqlQuery: String)(implicit spark: SparkSession, enc: org.apache.spark.sql.Encoder[T]): Dataset[T] = {
    val url = "jdbc:cloudspanner:/projects/your-project-id/instances/your-instance-id/databases/your-database-id?credentials=$jsonKeyFilePath"
    
    // Read data using Spark
    val df = spark.read
    .format("jdbc")
    .option("url", url)
    .option("dbtable", "myTable")
    .option("driver", "com.google.cloud.spanner.jdbc.JdbcDriver")
    .load()
    
    // Convert DataFrame to Dataset
    df.as[T]
    }
  2. See that it is returning the values as if every row is the column name: +----+---+ |name| id| +----+---+ |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| |name| id| +----+---+ only showing top 20 rows

My query is: "SELECT name, id FROM OrgInfoV2"

Any additional information below

It seems similar to this issue: https://stackoverflow.com/questions/66983401/spark-mariadb-jdbc-sql-query-returns-column-names-instead-of-column-values or this one: https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name

where it appears to be issues with the driver

Following these steps guarantees the quickest resolution possible.

Thanks!

olavloite commented 1 year ago

@dannnnthemannnn

I'm pretty sure that this is the same as for example https://github.com/sparklyr/sparklyr/issues/3196

The problem is that Spark seems to generate a query that looks like this:

select "name", "id"
from OrgInfoV2

Double quotes are used for string literals in Cloud Spanner (and BigQuery). Sparks seems to think that it is a valid way to quote column names in case any of the column names contain any spaces or are equal to reserved keywords.

A possible workaround is probably to explicitly use the MySQL dialect for your connection. MySQL uses the same type of quoting as Cloud Spanner. See https://github.com/apache/spark/blob/071feabbd4325504332679dfa620bc5ee4359370/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala#L108