saurfang / spark-sas7bdat

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
http://spark-packages.org/package/saurfang/spark-sas7bdat
Apache License 2.0
89 stars 40 forks source link

Incorrect date/timestamp conversions #37

Closed nelson2005 closed 5 years ago

nelson2005 commented 6 years ago

I'm running into this issue

where the dates all have one day subtracted. I'm using sas7bdat 2.0.0-s_2.11

I didn't see any other mention of this issue in the github issues list. Is anyone else running into this?

For se https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/

nelson2005 commented 6 years ago

Putting this on hold while I test more

witwall commented 6 years ago

@nelson2005 I just noticed this issue too. xy.sas7bdat.zip

right date is 2018-03-03

but got 2018-03-02

nelson2005 commented 6 years ago

Okay, this does appear to be really happening as described in the top post. To get the correct dates I've added a spark UDF like

spark.udf.register(Tags.date2timestamp_p1day, (date: java.sql.Date) => Try(java.sql.Timestamp.valueOf(date.toLocalDate.atStartOfDay().plusDays(1))).toOption

witwall commented 6 years ago

after double check, I am afraid it is not a bug, but because of the timezone issue of you server.

nelson2005 commented 6 years ago

@witwall can you elaborate? Both the SAS server where the sas7bdat was created and the spark server are in the same timezone

witwall commented 6 years ago

if in the same timezone, get correct answer without more configure. if different timezone, need setup timezone of spark to get the correct answer.

PhoenixDai commented 5 years ago

@nelson2005 With hint from witwall, I've figured out the following working configs:

Hope this can help you.

PhoenixDai commented 5 years ago

I also tested with loading data with timezone=UTC set and save the data as parquet format. Then load the parquet file without setting timezone. The date didn't change back. Hope this would help those who worry about setting timezone=UTC may mess up timestamp in other data sources.

nelson2005 commented 5 years ago

@PhoenixDai thanks, that seems to work.

kylebarron commented 5 years ago

I think this should be added to the README because it's very easy to get tripped up by this and not realize.

nelson2005 commented 5 years ago

It is easy to get tripped up on this (and might be worth a mention README), but I think it's fundamentally a Spark ecosystem issue, nothing specific to this project. People (including me) get tripped up all the time with this, for example here because Hive applies the server timezone to timestamps.

kylebarron commented 5 years ago

Interesting. I'm a relative newcomer to Spark, so this is the first time I'd gotten tripped up by it.

Tagar commented 5 years ago

There is a very active timestamp incompatibilities discussion on Spark dev list.

Here's a good summary, details and suggestions by the Spark development community https://goo.gl/VV88c5

In short, the following is being proposed:

This proposal is in accordance with the SQL standard and many major DB engines.

If stars will be aligned properly, this will be part of Spark 3.0 release..

ss23697 commented 4 years ago

Hi,

Can someone please help me, Im facing a similar issue. I tried setting the timezone to UTC but it is not working in spark yarn mode. It is only working in local mode.

Can someone please provide a snippet for how to get the correct date by setting timezone in pyspark yarn mode.

Thanks Subhankar

nelson2005 commented 4 years ago

Do these options work for you?

--driver-java-options="-Droot.logger=ERROR,console -Duser.timezone=UTC" --conf "spark.yarn.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

mrugesh1989 commented 2 years ago

I am having similar issue for date conversion where date is off by 2 days for the date year > 4000. I think that issue is tackled in parso latest version 2.14 but the Saurfang latest version is using parso 2.11. Does anyone know how to make parso 2.14 version work with saurfang 3.0 version or 2.10 version?