htm-community / river-view

Public Temporal Streaming Data Service Framework
http://data.numenta.org/
MIT License
28 stars 16 forks source link

[Q] Size of the River data for reasonable HTM learning #149

Open breznak opened 8 years ago

breznak commented 8 years ago

I'd like to ask about the purpose of this project: are the users expected to run the service locally and collect the data themselves, or should the data.numenta.org be a source of "temporal datasets"? If the latter, I see a problem with the sizes of collected data you are keeping around. Are these limited at your end? For example the Airnow streams count only about 60 datapoints, nowhere near amount suitable for any reasonable learning and HTM predictions.

rhyolight commented 8 years ago

I wanted to make sure people could host their own River View instance with whatever data they wanted. The instance we have running at data.numenta.org is kindof a showcase and proof-of-concept. Many of the rivers (like airnow) are not producing enough data over time. One of the problems with finding data is knowing how often the streams update. Sometimes it is hard to tell beforehand. In these cases I've created the river anyway just to see how much data gets populated over time. In the case of airnow, this river turned out to be too sparse. However, there are some airnow streams that produce much more data, like for Concord, CA. So is it worth throwing away the whole river?

All the data in River View is transient. It will all expire and disappear after a certain timeframe. Each River defines this timeframe in its configuration. For example, the btcc config:

# When should your collected data expire? This means that River View will store
# a time-boxed window of data. Data outside of this timebox will be flushed.
expires: 6 months

So data does not stick around forever. The default expiration period, I believe is 6 months.

breznak commented 8 years ago

Thanks Matt,

In the case of airnow, this river turned out to be too sparse. However, there are some airnow streams that produce much more data, like for Concord, CA. So is it worth throwing away the whole river?

definitely not. a) I didn't know other streams are more active. b) still good, I like the service as inspiration what kinds of interesting data exists. c) maybe you could collect some historical data and add (prepend) it to the data you are collecting live.

So data does not stick around forever. The default expiration period, I believe is 6 months.

based on the fact that you don't know how frequently the streams update, might be better to limit the max number of data points (say 100k?)