stringertheory / traces

A Python library for unevenly-spaced time series analysis
http://traces.readthedocs.io
MIT License
530 stars 58 forks source link

Event Based Time Series #229

Open nsteins opened 4 years ago

nsteins commented 4 years ago

Proposing a new class for Traces EventSeries for handling data that is a series of timestamps denoting the occurrence of discrete events. For example this collection of 311 requests in Chicago, where each record is a request that has a timestamp for when it was opened and when it was closed. This is a fit for Traces because it is another example of unevenly-spaced time series and can use traces.TimeSeries for certain calculations

An example of how the API might look

df = pd.read_csv('311_Service_Requests.csv',nrows=10000)
creation = EventSeries(df['CREATED_DATE'].dropna())
completion = EventSeries(df['CLOSED_DATE'].dropna())

Event series could tell you the amount of events that occured between two arbitrary timestamps

>>> creation.events_between(pd.Timestamp('2018-01-01'),pd.Timestamp('2019-02-01'))
6681

EventSeries would also have a cumulative sum function which returns a TimeSeries of the cumulative number of events that have occured since the first record

>>>ts = creation.cumsum()
>>>ts.plot()

image

For events that have a "open" and "close" time stamp, EventSeries can calculate the number of active open cases

>>>diff = EventSeries.count_active(creation, completion)
>>>diff.plot()

image

Finally, EventSeries can calculate the inter-event arrival times and create visualizations for analysis

>>>after = creation.time_lag(how='after')
>>>creation.plot_time_lag(how='after')

image

I am already working on implementing this, but I would appreciate feedback and suggestions on API or features. Particularly interested if this can be extended to support the use case outlined in this issue https://github.com/datascopeanalytics/traces/issues/227

johnhaire89 commented 4 years ago

This looks very useful, although I wonder if EventSeries could just a special case of TimeSeries. Using your example, each service request might be represented as a TimeSeries with two points.

service_call_event = traces.TimeSeries(default=0)
service_call_event[pd.Timestamp('2019-07-17 11:56:40')] = 1
service_call_event[pd.Timestamp('2019-07-30 13:14:54')] = 0

Suppose if you have the list of all service calls in a list named service_call_list where each event is a TimeSeries with 2 points, then your cumsum function might be the same as a merge operation:

active_events = traces.TimeSeries.merge(service_call_list, operation=sum)

All that said, I guess that this way of processing the data would be far less efficient than your method.

I have a device that flashes according to a timetable. It reports a "commencement" event when it starts flashing and a "cessation" event where it stops. I'm looking into a method to represent the state on a timeline by creating a TimeSeries for that state and adding a value of 1 for each commencement and a value of 0 for each cessation. I'm also trying to represent the device's timetable as a time series for the desired state, with a value for 1 for when it should start flashing and 0 for when it should stop flashing. With this method I can use a xor operation to generate a plottable time series of all the times that the desired state didn't equal the actual state.

I like your time_lag function because I want to work out the total amount of time that my actual flashing state didn't match with the desired state. However, now that I have a TimeSeries where y=1 for any time that the actual state didn't match the desired state, maybe that function can be performed by existing operation as well. @devs, Histogram.total() calculate the area under the curve?

nsteins commented 4 years ago

You are correct that you could represent this as a TimeSeries, and in fact, that was my first approach to modeling this kind of data. It's just slow because traces.TimeSeries.merge iterates through the entire SortedDict on every insertion.

johnhaire89 commented 4 years ago

Ah. Understood.

I feel like event_series is just a list of events, rather than something that fits into the library.

A faster way to build a timeseries could be

ts = traces.TimeSeries(default:0)
for row in df:
    ts[df['CREATED_DATE'].dropna()] = 1
    ts[df['CLOSED_DATE'].dropna()] = -1

A cumulative sum function could be an awesome addition to the api

cumsum_trace = traces.TimeSeries(default:0)
cumsum = 0
for k, v in ts.items():
    cum_sum += v
    cumsum_trace[k] = cumsum

As for feature requests, it could be cool if there was a function get_events(self, start_signal, end_signal) that returned a list of "events". Given (key, value) pairs in a time series, each event will have a start (key when value == start_signal) and an end (key when value == end_signal).

nsteins commented 4 years ago

I think that EventSeries fits in with Traces because it tries to follow a similar design and API to TimeSeries. There are obviously many ways to accomplish this, but I often found myself frustrated trying to accomplish this with pure pandas, and unable to do a lot of the things I wanted to with TimeSeries.

The main difference is that TimeSeries are designed around a model of an irregularly sampled continuous signal. I'm not sure what physical quantity a cumulative sum function would correspond to for a general TimeSeries.

Could you explain the get_events(self, start_signal, end_signal) request a bit more?

johnhaire89 commented 4 years ago

I think it could be nice to have a function that transforms a timeseries into a list of periods (each with a start and end time or a start time and duration) based on the values. You can then answer questions like "provide a list of periods where a light was switched on" or, using the shopping cart example from the docs, "provide a list of periods where the user had apples in their cart". start_signaland end_signal could be functions so that it works on non-numeric traces.

ThomDietrich commented 3 years ago

Hey @nsteins, coming here from #227. Are you working on this? The feedback was short but I think this would be a great addition to the library, as an EventSeries equally falls into the task traces tries to solve: Handling time series. The fact that there are these two main classes makes EventSeries quite logical. @stringertheory came to the same conclusion in #227

Any timeline for this or questions you still want to discuss? I guess that would be easiest managed in a preliminary PR.