samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

Don't clear the entire ENV when running EnvHacker #19

Closed vijaykramesh closed 7 years ago

vijaykramesh commented 7 years ago

This fixes https://github.com/samelamin/spark-bigquery/issues/16

There were two issues:

First, you were running a map.clear() that caused things to lose a SPARK_YARN_MODE ENV var that is used by spark to tell things it's in YARN mode.

Second, by converting the ENV vars to strings (instead of the internal Variable type) subsequent calls to sys.env.get would break, e.g. in the spark UI

This PR fixes both, following the basic idea in this stackoverflow answer. I've verified the library now works in both YARN mode and in non-YARN mode (the former running on Qubole, the latter running spark locally via docker).

Tests also pass locally for me:

vramesh@DB-MBP-SMG8WL:~/oss/spark-bigquery (vijay/fix_yarn)$ sbt test
[info] Loading project definition from /Users/vramesh/oss/spark-bigquery/project
[info] Set current project to spark-bigquery (in build file:/Users/vramesh/oss/spark-bigquery/)
[warn] Credentials file /Users/vramesh/.ivy2/.sbtcredentials does not exist
[info] Compiling 1 Scala source to /Users/vramesh/oss/spark-bigquery/target/scala-2.11/classes...
17/05/09 11:49:08 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/05/09 11:49:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/09 11:49:10 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
[info] BigQueryClientSpecs:
[info] Scenario: When writing to BQ
[info] Scenario: When reading from BQ
17/05/09 11:49:13 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/05/09 11:49:13 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
[info] BigQuerySchemaSpecs:
[info] Feature: Schema Converters. Dataframe To BQ Schema
[info]   Scenario: When converting a simple dataframe
[info]     Given A dataframe
[info]     When Passing the schema to the converter
[info]     Then We should receive a BQ Table Schema
[info]   Scenario: When converting a complex dataframe with nested data
[info]     Given A dataframe
[info]     When Passing the schema to the converter
[info]     Then We should receive a BQ Table Schema
[info]   Scenario: When converting from more obscure types
[info]     Given A dataframe
[info]     When Passing the schema to the converter
[info]     Then We should receive a BQ Table Schema
[info] ScalaCheck
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] ScalaTest
[info] Run completed in 5 seconds, 647 milliseconds.
[info] Total number of tests run: 5
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 5, Failed 0, Errors 0, Passed 5
[success] Total time: 11 s, completed May 9, 2017 11:49:14 AM
samelamin commented 7 years ago

Thanks alot @vijaykramesh, incidentally that is exactly where I got the envhacker code from

samelamin commented 7 years ago

I will deploy this later this evening

vijaykramesh commented 7 years ago

I somewhat guessed that's where you pulled the EnvHacker stuff from after looking through that thread on stackoverflow

Also in other small world/big data news, I've been following the airflow testing discussions that you've been leading, thanks for trying to corral people around that!

thanks again!

samelamin commented 7 years ago

You are absolutely welcome, we def need some sort of industry standard and this is our humble attempt at trying to create something. I added you to the email thread so please get involved!