Open breznak opened 8 years ago
:+1: :100:
@rhyolight looking forward to getting back to this project soon, as soon as I finish up some other responsibilities.
So, with the 2 standing PRs, the speed bottleneck should be somewhat resolved and internal support for streaming is in place.
What is left is a mechanism to monitor updates to the data file
(eg. periodically check the size) and update (only) with the newly added chunk of data (ideally a non-polling mechanism but on-request/update). I think the OSs do this well. I've checked with upstream and it's a known problem with no ideal solution: https://github.com/mholt/PapaParse/issues/49#issuecomment-163164936
These are the ideas that we have collected:
sliding window
: for real (infinite) streaming data, implement what @rhyolight did in RiverView - read the last WINDOW_SIZE rows, append/drop to existing, crop, plot. Can use fixed-sized FIFO (deque) to implement the sliding window. (Is this what you did?)interactive appending
: for large data, where we can (and want to) see the whole file in the end, but the creation (eg. running a HTM model) takes long time and we want to see the results in progress. We could a) just reread the whole file; b) remember last rowId, seek to that position, read next, append to our values. sampled
: We'd set POINTS_PER_GRAPH=10.000, data is read, subsampled, and rendered. On data update/zoom we reread the interesting section, subsample again,...Any idea, preference about these/other options?
@breznak @jefffohl @rhyolight This is rad, thanks for the hard work.
Thanks @brev !
TY @brev ! Would be nice to get your feedback and possible use-cases, if you like :)
Upcoming fix from Jeff for #56 further ensures the speed is OK.
@jefffohl with fixes in #64 and #66 I'd like to continue working on this functionality.
@breznak - I was imagining that we could periodically check the file to see if it has been modified, not actually read the file. If the file has been modified, then read.
Note also that for windowing, there are two things to be aware of:
The reason that the file size is not explicitly related to the number of rows in the buffer, is that we need to decide whether to window or not before we know how many rows there are.
I was imagining that we could periodically check the file to see if it has been modified, not actually read the file. If the file has been modified, then read.
yes, i think that's the idea. Will this work for remote files as well? (although it doesn't have to be supported since start) I think I saw some code to get a file size, that would be what we want, I guess?
The number of rows in the window is not related to the file size. Right now, it is set to 10,000 rows.
I know, OK I think.
The file size limit is what is used to determine if windowing will be used or not. We can add a feature that allows this to be manually set.
I think it can stay that way, just the monitoring will switch to windowing if needed.
Most servers should send back a "Last-Modified" header, so we could check that for remote servers. We can also just check the size (which we are already doing), and if that has changed, assume that new data has been added.
We've discussed this in the initial issue, there seem to be 2 approaches:
I find the latter better for 2 reasons: it does not tie us to NuPIC only, works for any updated CSV file, and secondly would not require (complex) changes to NuPIC ModelRunner framework.
UI changes to enable this could be:
Blocked by:
#16, #61