I am not sure whether this is the correct forum to raise this issue instead of StackOverflow.
I am trying to read TSV events from Snowplow in Spark as Dataframe but I am getting Corrupted records because my JSON String RDD has JSON string with Right().
Please correct me, if I am wrong.
scala> import com.snowplowanalytics.snowplow.analytics.scalasdk.json.EventTransformer
import com.snowplowanalytics.snowplow.analytics.scalasdk.json.EventTransformer
scala> val input = sc.textFile("events.tsv")
input: org.apache.spark.rdd.RDD[String] = events.tsv MapPartitionsRDD[1] at textFile at <console>:25
scala> val jsons = input.map (line => EventTransformer.transform(line)).filter(_.isRight).map(line => line.toString)
jsons: jsons: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> jsons.first()
res0: com.snowplowanalytics.snowplow.analytics.scalasdk.json.ValidatedEvent = Right({"contexts_com_google_analytics_cookies_1":[{"_ga":"GA1.2.929926098.1540998420"}],"contexts_com_snowplowanalytics_snowplow_web_page_1":[{"id":"6f904a0e-3408-47df-8d88-672a0adcc4aa"}],"contexts_org_w3_performance_timing_1":[{"navigationStart":1543333678447,"unloadEventStart":1543333680428,"unloadEventEnd":1543333680428,"redirectStart":0,"redirectEnd":0,"fetchStart":1543333678449,"domainLookupStart":1543333678468,"domainLookupEnd":1543333678845,"connectStart":1543333678845,"connectEnd":1543333679031,"secureConnectionStart":1543333678910,"requestStart":1543333679032,"responseStart":1543333680418,"responseEnd":1543333680420,"domLoading":1543333680427,"domInteractive":1543333681271,"domContentLoadedEventStart...
scala> val df = spark.read.json(jsons)
warning: there was one deprecation warning; re-run with -deprecation for details
2018-11-28 16:35:08 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
I am not sure whether this is the correct forum to raise this issue instead of StackOverflow. I am trying to read TSV events from Snowplow in Spark as Dataframe but I am getting Corrupted records because my JSON String RDD has JSON string with Right(). Please correct me, if I am wrong.