I want to be able to visualize (streaming) data online, as the (NuPIC) model is running

breznak commented 8 years ago

We've discussed this in the initial issue, there seem to be 2 approaches:

modifying the (nupic) framework so that the model is able to send "update" part to the visualizations tool.
another approach would be simply re-reading (somehow smart, keeping last line position etc) the raw CSV file (updated as the model is running).

I find the latter better for 2 reasons: it does not tie us to NuPIC only, works for any updated CSV file, and secondly would not require (complex) changes to NuPIC ModelRunner framework.

UI changes to enable this could be:

[ ] checkbox 'Online updates'
some (text) entry to set a numeric value for (time/num new lines) of update size

Blocked by: ~~#16~~, #61

rhyolight commented 8 years ago

:+1: :100:

jefffohl commented 8 years ago

@rhyolight looking forward to getting back to this project soon, as soon as I finish up some other responsibilities.

breznak commented 8 years ago

So, with the 2 standing PRs, the speed bottleneck should be somewhat resolved and internal support for streaming is in place.

What is left is a mechanism to monitor updates to the data file (eg. periodically check the size) and update (only) with the newly added chunk of data (ideally a non-polling mechanism but on-request/update). I think the OSs do this well. I've checked with upstream and it's a known problem with no ideal solution: https://github.com/mholt/PapaParse/issues/49#issuecomment-163164936

These are the ideas that we have collected:

sliding window : for real (infinite) streaming data, implement what @rhyolight did in RiverView - read the last WINDOW_SIZE rows, append/drop to existing, crop, plot. Can use fixed-sized FIFO (deque) to implement the sliding window. (Is this what you did?)
interactive appending : for large data, where we can (and want to) see the whole file in the end, but the creation (eg. running a HTM model) takes long time and we want to see the results in progress. We could a) just reread the whole file; b) remember last rowId, seek to that position, read next, append to our values.
sampled : We'd set POINTS_PER_GRAPH=10.000, data is read, subsampled, and rendered. On data update/zoom we reread the interesting section, subsample again,...

Any idea, preference about these/other options?

brev commented 8 years ago

@breznak @jefffohl @rhyolight This is rad, thanks for the hard work.

jefffohl commented 8 years ago

Thanks @brev !

breznak commented 8 years ago

TY @brev ! Would be nice to get your feedback and possible use-cases, if you like :)

breznak commented 8 years ago

Upcoming fix from Jeff for #56 further ensures the speed is OK.

breznak commented 8 years ago

@jefffohl with fixes in #64 and #66 I'd like to continue working on this functionality.

can we change the behavior of loading in the windowed-mode to read the file from the end? So that only 5MB is ever read; if I understand it now, all the file is parsed, then only the last 5MB left for rendering, correct? I'd like to avoid this wasting.
my plan to implement this feature is:
- keep loading normally until end of file is reached
- then switch to monitoring mode, that will poll the file every N miliseconds (~5000 default), compute a hash of the last row to see if the file changed. If yes, reread and render. Does it sound reasonable?

jefffohl commented 8 years ago

@breznak - I was imagining that we could periodically check the file to see if it has been modified, not actually read the file. If the file has been modified, then read.

Note also that for windowing, there are two things to be aware of:

The file size limit is what is used to determine if windowing will be used or not. Right now, that limit is set to 5MB. We can add a feature that allows this to be manually set.
The number of rows in the window is not related to the file size. Right now, the window buffer size is set to 10,000 rows.

The reason that the file size is not explicitly related to the number of rows in the buffer, is that we need to decide whether to window or not before we know how many rows there are.

breznak commented 8 years ago

I was imagining that we could periodically check the file to see if it has been modified, not actually read the file. If the file has been modified, then read.

yes, i think that's the idea. Will this work for remote files as well? (although it doesn't have to be supported since start) I think I saw some code to get a file size, that would be what we want, I guess?

The number of rows in the window is not related to the file size. Right now, it is set to 10,000 rows.

I know, OK I think.

The file size limit is what is used to determine if windowing will be used or not. We can add a feature that allows this to be manually set.

I think it can stay that way, just the monitoring will switch to windowing if needed.

jefffohl commented 8 years ago

Most servers should send back a "Last-Modified" header, so we could check that for remote servers. We can also just check the size (which we are already doing), and if that has changed, assume that new data has been added.

htm-community / nupic.visualizations

I want to be able to visualize (streaming) data online, as the (NuPIC) model is running #17