spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Temporary table creation fails when location contains "-" #72

Open yo4taka opened 5 years ago

yo4taka commented 5 years ago

I would like to fix the logic to generate datasetId in BigQueryClient # stagingTable.     val datasetId = prefix + location.toLowerCase

Is the following modification possible? (The notation is java.)   String datasetId = prefix + location.toLowerCase (). ReplaceAll ("[^ a-z0-9 ] +", "")

Explain the situation. I am considering using connectors to transfer data from BigQuery to the application on dataproc.

The data location we use is "asia-northeast1". This is a string containing "-".

As a result, it seems that table creation fails when creating a temporary table like the following log.

{"loglevel": "INFO", "time": "2019-06-13 11: 21: 46.591", "appname": "job-executor", "function": "com.spotify.spark.bigquery.BigQueryClient .stagingDataset: 148 "," message ": Creating staging dataset repx-dev-jp-fiot-mgr: spark_bigquery_staging_asia-northeast1} java.util.concurrent.ExecutionException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request {   "code": 400,   "errors": [{     "domain": "global",     "message": "Invalid dataset ID \" spark_bigquery_staging_asia-northeast1 \ ". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",     "reason": "invalid"   }],   "message": "Invalid dataset ID \" spark_bigquery_staging_asia-northeast1 \ ". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",   "status": "INVALID_ARGUMENT" } at com.google.common.util.concurrent.AbstractFuture.getDoneValue (AbstractFuture.java:500) at com.google.common.util.concurrent.AbstractFuture.get (AbstractFuture.java:459) at com.google.common.util.concurrent.AbstractFuture $ TrustedFuture.get (AbstractFuture.java:76) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly (Uninterruptibles.java:142) at com.google.common.cache.LocalCache $ Segment.getAndRecordStats (LocalCache.java:2373) at com.google.common.cache.LocalCache $ Segment.loadSync (LocalCache.java:2337) at com.google.common.cache.LocalCache $ Segment.lockedGetOrLoad (LocalCache.java:2295) at com.google.common.cache.LocalCache $ Segment.get (LocalCache.java:2208) at com.google.common.cache.LocalCache.get (LocalCache.java:4053) at com.google.common.cache.LocalCache.getOrLoad (LocalCache.java:4057) at com.google.common.cache.LocalCache $ LocalLoadingCache.get (LocalCache.java:4986) at com.spotify.spark.bigquery.BigQueryClient.query (BigQueryClient.scala: 105) at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect (BigQuerySQLContext.scala: 93)

yo4taka commented 5 years ago

There is a mistake in the repair plan. The correct is below.

String datasetId = prefix + location.toLowerCase().replaceAll("[^a-z0-9]+", "");