nytlabs / tick

"The Tick seems to have no memory of its existence before being The Tick, and indeed not much memory of anything."
5 stars 1 forks source link

How does Tick determine what is a "useful" time series? #2

Open durple opened 10 years ago

durple commented 10 years ago

So we could go one of the three ways here:

  1. Create a time series for everything if a stream has attributes A,B,C we create a time series for A,B, AB, AC, BC, ABC.
    • Pros: We have a time series for everything imaginable.
    • Cons: We have a time series for everything imaginable and also not useful leading to a wasteland of time series data e.g. a time series of unique identifiers that appear only once or identifiers that are constantly changing.
  2. Have a user pick and choose by some mechanism. So if I have users, locations and event_id. I can pick users over time and users over time broken by locations.
    • Pros: User gets just the data he/she wants and it can be made available
    • Cons: It sort of defeats the purpose of having tick, there is no experimentation in this problem.
  3. Lastly, have tick determine what is "useful" using some measure of the volume, dimensions of a stream and cardinality of the attributes themselves. The method of determination can be tweaked in various ways and experimented with. It may or may not always yield the right time series but can possibly be optimized over time to give better results.
    • Pros: less waste. No user selection
    • Cons: We don't know what the hell we are talking about here.
mikedewar commented 10 years ago

think this has different answers depending on the dimensionality of the series..

mikedewar commented 10 years ago

in 1D (like views on a page) you could make a case that volume is a good indicator of useful, or maybe variance? in >1D I bet covariance would be a good starting point.

A useless time series then is one that is always zero, or more generally alway the same.

mikedewar commented 10 years ago

hey also I bet there is a proper answer to this in terms of information content. Like a useful timeseries is one that is hard to compress, has low entropy etc. There's a lovely green book on my desk by Mackay that has opinions...

nikhan commented 10 years ago

I am more curious as to how this makes sense for a user

Whatever determination you use, it means that there will be a result for some queries and no result for others. And neither of those results would necessarily mean "because there was no data"

which is confusing to me.

durple commented 10 years ago

That is a very fat book!

durple commented 10 years ago

Whatever determination you use, it means that there will be a result for some queries and no result for others.

This is quite implementation specific, I think. We could implement tick such that a user knows what time series are being made available once it starts listening to the stream.

But you are right, if I have A, B & C and Tick determined that A, A & B are the only useful time series but the user was interested in B & C. I don't know how to handle that. It almost becomes a back to the drawing board problem to solve.

mikedewar commented 10 years ago

What about thinking of it more as a compression problem? If there is no information in the series given the other time series then you should be able to recreate the series using other series at query time...

Alternatively, a "no information" response to a query is an interesting thing for a db to respond with...

M On Nov 4, 2014 1:21 PM, "Deep Kapadia" notifications@github.com wrote:

Whatever determination you use, it means that there will be a result for some queries and no result for others.

This is quite implementation specific, I think. We could implement tick such that a user knows what time series are being made available once it starts listening to the stream.

But you are right, if I have A, B & C and Tick determined that A, A & B are the only useful time series but the user was interested in B & C. I don't know how to handle that. It almost becomes a back to the drawing board problem to solve.

— Reply to this email directly or view it on GitHub https://github.com/nytlabs/tick/issues/2#issuecomment-61688264.

durple commented 10 years ago

Still wrapping my head around thinking of it as a compression problem...just grabbed the green book

Alternatively, a "no information" response to a query is an interesting thing for a db to respond with...

But is it useful if I am looking for something very specific?

nikhan commented 10 years ago

Alternatively, a "no information" response to a query is an interesting thing for a db to respond with...

only if it can be explained simply

nikhan commented 10 years ago

If you have timeseries for each key, couldn't you create what A&B would be? why do you need a time series for groups?

nikhan commented 10 years ago

Oh right, intersection vs exclusive. oh well

nikhan commented 10 years ago

Can I have table "key" with row "co occurrence" by time?

durple commented 10 years ago

If you have timeseries for each key, couldn't you create what A&B would be? why do you need a time series for groups?

No. Consider for example the following stream:

{user: Deep, location: NYC, ts:1}
{user: Deep, location: NJ, ts:1}
{user: Nik, location: NYC, ts: 1}
{user: Nik location: SFO: ts 1}
{user Mike, location: NYC, ts:1}
{user Deep, location: NYC, ts:1}

Time series:

user
Deep ->(ts:1, count:3)
Nik ->(ts:1, count:2)
Mike->(ts:1, count1)

location 
NYC -> (ts:1,count: 4)
NJ -> (ts:1, count: 1)
SFO ->(ts:1, count:1)

And if my question is give me all the times Deep was in NYC, I can't decipher it from the above time series. I can however decipher it from

user,location
Deep,NYC->(ts:1,count:2)
Deep,NJ->(ts:1,count:1)
Nik,NYC->(ts:1,count:1)
Nik,SFO->(ts:1,count:1)
Mike,NYC->(ts:1,count:1)
durple commented 10 years ago

Oh right, intersection vs exclusive. oh well

Great! I spent 5 minutes building time series by hand from a stream of imaginary JSON.

nikhan commented 10 years ago

sorry :anguished:

durple commented 10 years ago

Can I have table "key" with row "co occurrence" by time?

Not sure if I understand. Isn't that the same as having more than one column as a primary key? If so, it becomes the same as what I mentioned in the example

nikhan commented 10 years ago

what is wrong with that?

mikedewar commented 10 years ago

amen re: explaining no data