twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
993 stars 184 forks source link

Python Clocks Not returning proper intervals. #58

Open ydesai-exos opened 5 years ago

ydesai-exos commented 5 years ago

@icexelloss

The clocks function for Flint in python is returning incorrect intervals.

The time intervals appear far too large than what I am specifying into the function.

For example:

from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1s", offset="0ns")
clock.show()

returns

time:timestamp
+-------------------+
|               time|
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:16:40|
|1970-01-01 00:33:20|
|1970-01-01 00:50:00|
|1970-01-01 01:06:40|
|1970-01-01 01:23:20|
|1970-01-01 01:40:00|
|1970-01-01 01:56:40|
|1970-01-01 02:13:20|
|1970-01-01 02:30:00|
|1970-01-01 02:46:40|
|1970-01-01 03:03:20|
|1970-01-01 03:20:00|
|1970-01-01 03:36:40|
|1970-01-01 03:53:20|
|1970-01-01 04:10:00|
|1970-01-01 04:26:40|
|1970-01-01 04:43:20|
|1970-01-01 05:00:00|
|1970-01-01 05:16:40|
+-------------------+
only showing top 20 rows

It should be 1 second intervals but returns intervals of 16 min 40 seconds.

Similarly, an interval of 1 day returns intervals of 2 years.

from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1d", offset="0ns", )
clock.show()
+-------------------+
|               time|
+-------------------+
|1970-01-01 00:00:00|
|1972-09-27 00:00:00|
|1975-06-24 00:00:00|
|1978-03-20 00:00:00|
|1980-12-14 00:00:00|
|1983-09-10 00:00:00|
|1986-06-06 00:00:00|
|1989-03-02 00:00:00|
|1991-11-27 00:00:00|
|1994-08-23 00:00:00|
|1997-05-19 00:00:00|
|2000-02-13 00:00:00|
|2002-11-09 00:00:00|
|2005-08-05 00:00:00|
|2008-05-01 00:00:00|
|2011-01-26 00:00:00|
|2013-10-22 00:00:00|
|2016-07-18 00:00:00|
|2019-04-14 00:00:00|
|2022-01-08 00:00:00|
+-------------------+
only showing top 20 rows

Also when I supply custom start and end times, the years returned are way out of range.

from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1d", offset="0ns", begin_date_time="2014-04-23", end_date_time="2015-04-23")
clock.show()
time:timestamp
+--------------------+
|                time|
+--------------------+
|46277-07-20 00:00...|
|46280-04-15 00:00...|
|46283-01-10 00:00...|
|46285-10-06 00:00...|
|46288-07-02 00:00...|
|46291-03-29 00:00...|
|46293-12-23 00:00...|
|46296-09-18 00:00...|
|46299-06-15 00:00...|
|46302-03-12 00:00...|
|46304-12-06 00:00...|
|46307-09-02 00:00...|
|46310-05-29 00:00...|
|46313-02-22 00:00...|
|46315-11-19 00:00...|
|46318-08-15 00:00...|
|46321-05-11 00:00...|
|46324-02-05 00:00...|
|46326-11-01 00:00...|
|46329-07-28 00:00...|
+--------------------+
only showing top 20 rows
5mdd commented 5 years ago

I have exactly the same problem. It seems like clock mixes seconds with milliseconds. A workaround ("hack") is to use ms instead of s in the code, in your case for a 1s interval: from ts.flint import clocks clock = clocks.uniform(sqlContext, frequency="1ms", offset="0ns") clock.show()

or for one day interval: from ts.flint import clocks clock = clocks.uniform(sqlContext, frequency="86400ms", offset="0ns") clock.show()

LeoDashTM commented 5 years ago

Thanks for the workaround @5mdd - I'll try it out. But of course, TwoSigma just needs to fix this! @icexelloss ?

5mdd commented 5 years ago

I forgot to mention that I am using databricks runtime 5.2 ML (Spark 2.4.0/Scala 2.11) with the databricks flint jar: flint_0_6_0_databricks.jar