twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
995 stars 184 forks source link

[Python] dataframe summarize and summarizeCycles possible malfunction #13

Open laldonza opened 7 years ago

laldonza commented 7 years ago

I have a dataset which time index is daily, and I want to sum all the values for the same day (same index). I can do this function using the summarize method, but it doesn't work specifying the key 'time' (my flint time index).

Using: flintdf.summarize(summarizers.sum('values'), key='time') or flintdf.summarizeCycles(summarizers.sum('values'), key='time')

gives me this error:

Py4JJavaError: An error occurred while calling o449.summarize. : com.twosigma.flint.timeseries.row.DuplicateColumnsException: Found duplicate columns List(time) in schema...

The point is that I also tried using summarizeCycles without a key, but it sum the time too and gives me the total sum of absolutely everything:

flintdf.summarizeCycles(summarizers.count())

returns me something like this:

|-------------time------------|count| |9223372036854775807| 3030|

I think this could be another possible malfunction, because there are many different timestamps in my dataset and as the documentation says,

"Computes aggregate statistics of rows that share a timestamp." .

And the last thing I tried, was to use another date field, which permits me to use the summarize and summarizeCycles with date as key, but it looks like the summarize method deletes the timeindex, making all the values 0 and the resulting dataframe with the values unsorted. Using summarizeCycles with key, returns the same dataframe, taking only the first element with a timestamp, the repeated index rows are deleted and as index, it uses the same for every value, that is the sum of all the times, in my case 9223372036854775807

Python version: 3.5

icexelloss commented 7 years ago

I have a dataset which time index is daily, and I want to sum all the values for the same day

Can you try sth like this?

from ts.flint import clocks
from ts.flint import summarizers

flintdf.summarizeIntervals(clocks.uniform(sqlContext, frequency='1day'), summarizers.count())
icexelloss commented 7 years ago

Also, does

flintdf.summarizeCycles(summarizers.count())

returns you only one row with all the data count?

laldonza commented 7 years ago

Yes, both return me only one row and both with the same result

Could you reproduce the error or is it my fault?

icexelloss commented 7 years ago

I cannot see your data so It's hard to tell.

Can you provide a reproducible with code and data file? On Fri, May 12, 2017 at 8:54 AM laldonza notifications@github.com wrote:

Yes, both return me only one row and both with the same result

Could you reproduce the error or is it my fault?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/twosigma/flint/issues/13#issuecomment-301068662, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwbrI1HX4Tv_qFoDaI3aWWLVhvn8ezwks5r5FZ9gaJpZM4NZLSw .

laldonza commented 7 years ago

Ok, will do during this weekend or on monday

laldonza commented 7 years ago

I can't provide the original data, but here is a toy example. I have uploaded it as plain text because github doesn't support csv or python scripts. datacsv.txt flintTestpy.txt

I think the problem is with the type of the time index, because in the example you use longtype. The flint dataframe gets datetime fine but maybe the method summarize needs the long type.

laldonza commented 7 years ago

I tried performing in between a conversion from datetime to long but still doesn't work properly I change the type like this df['time']=df['time'].astype(np.long)