twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
993 stars 184 forks source link

"timeColumn" option not respected in a "read.dataframe" call #56

Open LeoDashTM opened 5 years ago

LeoDashTM commented 5 years ago

@icexelloss hi there!

I'm glad the issues are being (pro)actively monitored and attended to, I wasn't expecting that.

Here is one issue I'm facing, it's not a big one, but an inconvenient one:

print( sc.version )
print( tm )

n = df.filter( df['Container'] == 'dbc94d4e3af6' ).select( tm, 'MemPercentG', 'CpuPercentG' )
n.show( truncate = False )
n.printSchema()

from ts.flint import FlintContext, clocks
from ts.flint import utils

fc = FlintContext( sqlContext )

r = fc.read \
    .option('isSorted', False) \
    .option('timeUnit', 's') \
    .option('timeColumn', tm) \
    .dataframe( n )

The output is:

2.3.1
TimeStamp
+-------------------+---------------+------------+
|TimeStamp          |MemPercentG    |CpuPercentG |
+-------------------+---------------+------------+
|2018-08-01 05:55:35|0.0030517578125|0.002331024 |
|2018-08-01 05:58:05|0.0030517578125|0.0031538776|
|2018-08-01 05:59:05|0.0030517578125|0.0030176123|
+-------------------+---------------+------------+

root
 |-- TimeStamp: timestamp (nullable = true)
 |-- MemPercentG: double (nullable = true)
 |-- CpuPercentG: float (nullable = true)

IllegalArgumentException: 'Field "time" does not exist.\nAvailable fields: TimeStamp, MemPercentG, CpuPercentG'
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<command-911439891027714> in <module>()
     14 fc = FlintContext( sqlContext )
     15 
---> 16 r = fc.read     .option('isSorted', False)     .option('timeUnit', 's')     .option('timeColumn', tm)     .dataframe( n )
     17 
     18 

/databricks/python/lib/python3.5/site-packages/ts/flint/readwriter.py in dataframe(self, df, begin, end, timezone, is_sorted, time_column, unit)
    362             time_column=time_column,
    363             is_sorted=is_sorted,
--> 364             unit=self._parameters.timeUnitString())
    365 
    366     def parquet(self, *paths):

/databricks/python/lib/python3.5/site-packages/ts/flint/dataframe.py in _from_df(df, time_column, is_sorted, unit)
    248                                    time_column=time_column,
    249                                    is_sorted=is_sorted,
--> 250                                    unit=unit)
    251 
    252     @staticmethod

/databricks/python/lib/python3.5/site-packages/ts/flint/dataframe.py in __init__(self, df, sql_ctx, time_column, is_sorted, unit, tsrdd_part_info)
    133         # throw exception
    134         if time_column in df.columns:
--> 135             self._jdf = self._jpkg.TimeSeriesRDD.canonizeTime(self._jdf, self._junit)
    136 
    137         if tsrdd_part_info:

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'Field "time" does not exist.\nAvailable fields: TimeStamp, MemPercentG, CpuPercentG'

Is this reproducible for you?

Please, advise, if I'm not using/calling it correctly or if it's a bug.

The flint libraries (the Scala and the Python ones) I installed on DataBricks via its UI (from the respective online repos, which might be dated) - I can try and install the latest builds from the freshest source code, if you think that will help.

Thanks.

icexelloss commented 5 years ago

I suspect that is a bug. Please rename the time column to "time" for the time being (pun intended)