Closed nelson2005 closed 5 years ago
Putting this on hold while I test more
@nelson2005 I just noticed this issue too. xy.sas7bdat.zip
right date is 2018-03-03
but got 2018-03-02
Okay, this does appear to be really happening as described in the top post. To get the correct dates I've added a spark UDF like
spark.udf.register(Tags.date2timestamp_p1day, (date: java.sql.Date) => Try(java.sql.Timestamp.valueOf(date.toLocalDate.atStartOfDay().plusDays(1))).toOption
after double check, I am afraid it is not a bug, but because of the timezone issue of you server.
@witwall can you elaborate? Both the SAS server where the sas7bdat was created and the spark server are in the same timezone
if in the same timezone, get correct answer without more configure. if different timezone, need setup timezone of spark to get the correct answer.
@nelson2005 With hint from witwall, I've figured out the following working configs:
spark-shell --master=local[*] --driver-java-options="-Duser.timezone=UTC"
I believe spark-submit would work in this way too. (didn't test).set('spark.driver.extraJavaOptions','-Duser.timezone=UTC').set('spark.executor.extraJavaOptions','-Duser.timezone=UTC')
Hope this can help you.
I also tested with loading data with timezone=UTC set and save the data as parquet format. Then load the parquet file without setting timezone. The date didn't change back. Hope this would help those who worry about setting timezone=UTC may mess up timestamp in other data sources.
@PhoenixDai thanks, that seems to work.
I think this should be added to the README because it's very easy to get tripped up by this and not realize.
It is easy to get tripped up on this (and might be worth a mention README), but I think it's fundamentally a Spark ecosystem issue, nothing specific to this project. People (including me) get tripped up all the time with this, for example here because Hive applies the server timezone to timestamps.
Interesting. I'm a relative newcomer to Spark, so this is the first time I'd gotten tripped up by it.
There is a very active timestamp incompatibilities discussion on Spark dev list.
Here's a good summary, details and suggestions by the Spark development community https://goo.gl/VV88c5
In short, the following is being proposed:
This proposal is in accordance with the SQL standard and many major DB engines.
If stars will be aligned properly, this will be part of Spark 3.0 release..
Hi,
Can someone please help me, Im facing a similar issue. I tried setting the timezone to UTC but it is not working in spark yarn mode. It is only working in local mode.
Can someone please provide a snippet for how to get the correct date by setting timezone in pyspark yarn mode.
Thanks Subhankar
Do these options work for you?
--driver-java-options="-Droot.logger=ERROR,console -Duser.timezone=UTC" --conf "spark.yarn.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"
I am having similar issue for date conversion where date is off by 2 days for the date year > 4000. I think that issue is tackled in parso latest version 2.14 but the Saurfang latest version is using parso 2.11. Does anyone know how to make parso 2.14 version work with saurfang 3.0 version or 2.10 version?
I'm running into this issue
where the dates all have one day subtracted. I'm using sas7bdat 2.0.0-s_2.11
I didn't see any other mention of this issue in the github issues list. Is anyone else running into this?
For se https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/