samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

writing to bigquery from dataproc #81

Closed mikerlt closed 4 years ago

mikerlt commented 4 years ago

when trying to write to bigquery like so from pyspark in dataproc: `bigquery = spark._sc._jvm.com.samelamin.spark.bigquery

Prepare the bigquery context

bq = bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext) bq.setBigQueryProjectId(BQ_PROJECT_ID) bq.setGSProjectId(BQ_PROJECT_ID) bq.setBigQueryGcsBucket(STAGING_BUCKET) bq.setBigQueryDatasetLocation(DATASET_LOCATION)

Extract and Transform a dataframe

df = session.read.csv(...)

Load into a table or table partition

bqDF = bigquery.BigQueryDataFrame(df_master._jdf) bqDF.saveAsBigQueryTable( "{0}:{1}.{2}".format(BQ_PROJECT_ID, DATASET_ID, TABLE_NAME), False, # Day paritioned when created 0, # Partition expired when created bigquery.getattr("package$WriteDisposition$").getattr("MODULE$").WRITE_EMPTY(), bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(), )` I am getting the following exception:

`Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars :: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.github.samelamin#spark-bigquery_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f;1.0 confs: [default] found com.github.samelamin#spark-bigquery_2.11;0.2.6 in central found com.databricks#spark-avro_2.11;4.0.0 in central found org.slf4j#slf4j-api;1.7.5 in central found org.apache.avro#avro;1.7.6 in central found org.codehaus.jackson#jackson-core-asl;1.9.13 in central found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central found com.thoughtworks.paranamer#paranamer;2.3 in central found org.xerial.snappy#snappy-java;1.0.5 in central found org.apache.commons#commons-compress;1.4.1 in central found org.tukaani#xz;1.0 in central found com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 in central found com.google.api-client#google-api-client-java6;1.24.1 in central found com.google.api-client#google-api-client;1.24.1 in central found com.google.oauth-client#google-oauth-client;1.24.1 in central found com.google.http-client#google-http-client;1.24.1 in central found com.google.code.findbugs#jsr305;3.0.2 in central found org.apache.httpcomponents#httpclient;4.0.1 in central found org.apache.httpcomponents#httpcore;4.0.1 in central found commons-logging#commons-logging;1.1.1 in central found commons-codec#commons-codec;1.6 in central found com.google.http-client#google-http-client-jackson2;1.24.1 in central found com.fasterxml.jackson.core#jackson-core;2.9.2 in central found com.google.guava#guava;26.0-jre in central found org.checkerframework#checker-qual;2.5.2 in central found com.google.errorprone#error_prone_annotations;2.1.3 in central found com.google.j2objc#j2objc-annotations;1.1 in central found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central found com.google.oauth-client#google-oauth-client-java6;1.24.1 in central found com.google.api-client#google-api-client-jackson2;1.24.1 in central found com.google.apis#google-api-services-storage;v1-rev135-1.24.1 in central found com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 in central found com.google.cloud.bigdataoss#util;1.9.4 in central found com.google.auto.value#auto-value-annotations;1.6.2 in central found com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 in central found com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 in central found com.google.cloud.bigdataoss#gcsio;1.9.4 in central found joda-time#joda-time;2.9.3 in central :: resolution report :: resolve 1060ms :: artifacts dl 27ms :: modules in use: com.databricks#spark-avro_2.11;4.0.0 from central in [default] com.fasterxml.jackson.core#jackson-core;2.9.2 from central in [default] com.github.samelamin#spark-bigquery_2.11;0.2.6 from central in [default] com.google.api-client#google-api-client;1.24.1 from central in [default] com.google.api-client#google-api-client-jackson2;1.24.1 from central in [default] com.google.api-client#google-api-client-java6;1.24.1 from central in [default] com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 from central in [default] com.google.apis#google-api-services-storage;v1-rev135-1.24.1 from central in [default] com.google.auto.value#auto-value-annotations;1.6.2 from central in [default] com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 from central in [default] com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 from central in [default] com.google.cloud.bigdataoss#gcsio;1.9.4 from central in [default] com.google.cloud.bigdataoss#util;1.9.4 from central in [default] com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 from central in [default] com.google.code.findbugs#jsr305;3.0.2 from central in [default] com.google.errorprone#error_prone_annotations;2.1.3 from central in [default] com.google.guava#guava;26.0-jre from central in [default] com.google.http-client#google-http-client;1.24.1 from central in [default] com.google.http-client#google-http-client-jackson2;1.24.1 from central in [default] com.google.j2objc#j2objc-annotations;1.1 from central in [default] com.google.oauth-client#google-oauth-client;1.24.1 from central in [default] com.google.oauth-client#google-oauth-client-java6;1.24.1 from central in [default] com.thoughtworks.paranamer#paranamer;2.3 from central in [default] commons-codec#commons-codec;1.6 from central in [default] commons-logging#commons-logging;1.1.1 from central in [default] joda-time#joda-time;2.9.3 from central in [default] org.apache.avro#avro;1.7.6 from central in [default] org.apache.commons#commons-compress;1.4.1 from central in [default] org.apache.httpcomponents#httpclient;4.0.1 from central in [default] org.apache.httpcomponents#httpcore;4.0.1 from central in [default] org.checkerframework#checker-qual;2.5.2 from central in [default] org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default] org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default] org.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default] org.slf4j#slf4j-api;1.7.5 from central in [default] org.tukaani#xz;1.0 from central in [default] org.xerial.snappy#snappy-java;1.0.5 from central in [default] :: evicted modules: org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]

|                  |            modules            ||   artifacts   |
|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
|      default     |   38  |   0   |   0   |   1   ||   37  |   0   |
---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f confs: [default] 0 artifacts copied, 37 already retrieved (0kB/19ms) 19/09/17 16:14:30 INFO org.spark_project.jetty.util.log: Logging initialized @4647ms 19/09/17 16:14:30 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 19/09/17 16:14:31 INFO org.spark_project.jetty.server.Server: Started @4753ms 19/09/17 16:14:31 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 19/09/17 16:14:31 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041} 19/09/17 16:14:31 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at sidm-cluster-02-m/10.128.0.2:8032 19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at sidm-cluster-02-m/10.128.0.2:10200 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.github.samelamin_spark-bigquery_2.11-0.2.6.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/joda-time_joda-time-2.9.3.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-java6-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-jackson2-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.guava_guava-26.0-jre.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-java6-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-1.9.4.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-1.24.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpclient-4.0.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpcore-4.0.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-codec_commons-codec-1.6.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.9.2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.checkerframework_checker-qual-2.5.2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.errorprone_error_prone_annotations-2.1.3.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.j2objc_j2objc-annotations-1.1.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.mojo_animal-sniffer-annotations-1.14.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.auto.value_auto-value-annotations-1.6.2.jar added multiple times to distributed cache. 19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcsio-1.9.4.jar added multiple times to distributed cache. 19/09/17 16:14:35 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1568730559207_0014 19/09/17 16:16:10 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. Traceback (most recent call last): File "/tmp/5573a94d3db144558111c14b4dba7f38/ingest_bq.py", line 49, in bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(), File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o68.saveAsBigQueryTable. : java.lang.NoSuchMethodError: com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(Ljava/lang/String;)Lcom/google/api/services/bigquery/model/TableReference; at com.samelamin.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:40) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

19/09/17 16:16:11 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041} Job output is complete`

samelamin commented 4 years ago

Sounds like you are missing some libraries, try ensuring the bigdata-interop libraries are added to the class path

Using scala libs from python is always a pain im afraid