As a developer, I need a proxy server, so that I can handle streaming data

jefffohl commented 8 years ago

Motivation: proxy server can be used for streaming remote files on servers that do not have that support. And to implement seek (skip N bytes) in files or streams from JS.

This will be used in: #17

Tasks:

[ ] Handle local files
[ ] Handle remote files
[ ] Implement server-side CSV parsing?
[ ] When app is hosted on remote server, how do we handle file upload?

breznak commented 8 years ago

@jefffohl before you begin implementing this feature with proxy server from npm, would you consider a more difficult but also more far-reaching change and improving Papa? As the author said he won't implement it himself, but is not against the solution: https://github.com/mholt/PapaParse/issues/49#issuecomment-163286369

jefffohl commented 8 years ago

@breznak - I would, but I don't think it is possible within the browser. The FileReader only takes a snapshot of the file at the time of the request, and it will not allow the browser to read the file again without a user interaction - most likely for security reasons.

For some reason, Chrome's handle on the file does allow you to read the file size as the file is updated, but the data in the file cannot be retrieved (this is what confused me for a day). See the HTML5 spec: https://www.w3.org/TR/FileAPI/#file

I believe this is the reason that the PapaParse developer is not willing to work on it - it is simply not possible.

jefffohl commented 8 years ago

Actually, it appears that it might be possible to reload the file if we see that it has changed, but we would have to reload the entire file - which defeats our original purpose, because then we would have to reload all of the data, and on large files that would be very inefficient. What we need is the ability to read just the portion of the file that has changed.

breznak commented 8 years ago

Thank your for the Specs! I still don't see how your workaround should work (or why Papa's shouldn't):

f is a reference to File object:
- byte content is a snapshot of the file at creation of the reference
- may/or may not provide modified and size parameter. (are these snapshoted=frozen too, or change dynamically?)
if we can get the parameters, monitor them periodically (else exception; still some support better than nothing)
if file changed, efficiently reread the new update
- in the Spec I don't see a seek(=skip to Nth byte) or read with offset methods; this is a problem I don't know how you plan to work-around?
- if such method exists, update and loop.

jefffohl commented 8 years ago

The workaround - of reloading the entire file whenever the file size changes - can be seen here: http://stackoverflow.com/questions/22548683/reloading-a-file-using-html-input

breznak commented 8 years ago

I'm not sure I understood the SO solutions correctly, but these ideas might work:

if Papa is that fast ( https://jsperf.com/javascript-csv-parsers/4 ), just reread the whole file. this could be implemented in the Papa's chunk call (with "monitor=true" argument) to return only the diff (to the graph for rendering). It is suboptimal, but would still be a huge simplification for many problems.
Assuming we can destroy the source file, we can read, delete the read part, wait, loop. Saving the time on not-rereading the known bytes. This could also be implemented for Papa.
How about Baby? https://github.com/Rich-Harris/BabyParse Does Node.js provide more "local app" privileges, allowing to access a file more directly?

jefffohl commented 8 years ago

Node runs on the server, so it has all the permissions that you want to give it. So, yes, it can access any file on the system.

jefffohl commented 8 years ago

So - yes, we could use Baby Parse and parse files on the server side instead of in the browser.

breznak commented 8 years ago

4 (We should follow @rhyolight 's advice and...) delegate this to other project, with different (lower) level of integration, that can easily take care of the file updates, and provide us only with a diff file which we would reread quickly and append to our data. Eg a script (some multiplatform code?) like

while(true) { cp myFile myFile.old; sleep 5; diff myFile myFile.old > update; }

jefffohl commented 8 years ago

delegate to what other project?

breznak commented 8 years ago

Well, we can just require to have only the diffs (not whole updated) file as input for Monitoring? Or provide a simple utility (in Java, ...) to do the diffs in intervals for us, as above. Or truncate the file (not sure a browser JS can do that?)

breznak commented 8 years ago

Can we make the (client, browser) app a "server-like app" that has REST API? https://stackoverflow.com/questions/921942/javascript-rest-client-library So we could have an update(data) method callable throught REST PUT? This was the idea in #42

Sorry, this was just a brainstorm/shitload :stuck_out_tongue: of ideas, not sure which are doable or suitable for us..?

jefffohl commented 8 years ago

I am imagining the server will have some REST-like features, but it will probably be just GET.

Why would you need PUT, if we are just reading CSV files?

jefffohl commented 8 years ago

The server I am imagining will be pretty basic. It will handle the following functions:

Serving static files (index.html, JS, CSS, etc.)
Handling files on the local file system, and allowing us to stream them.
Acting as a proxy server for retrieving remote files from other servers.

jefffohl commented 8 years ago

All that said, if we can define an abstract purpose for the server outside of the needs of this particular app, we could make it a separate project/repo.

breznak commented 8 years ago

yes, I think that's a good functionality for the server. Let me doublecheck I understand the advantages: allows to stream remote files (even if the other server does not support that feature)? + streaming local files (does it solve the problem discussed here on avoiding re-reading the whole file for monitoring mode?)

My REST idea is probably a separate feature, allowing the updates be "sent" by REST calls (allows integration with many web services, which are restful, like RiverView), in addition to updates by writing to a file.

jefffohl commented 8 years ago

Yes, your understanding is correct. And yes, it will solve the problem of avoiding the need to re-read the entire file each time it is updated. We will be able to have a server-side file handle that will allow us to read the file.

breznak commented 8 years ago

:+1: :cool:

jefffohl commented 8 years ago

One thing that this brings up is that the experience will differ depending on whether the app is hosted locally or remotely (e.g. on a public web server). If the app is hosted locally, we can access and stream local files with continuous updates. If the app is hosted remotely, the only interface we will have to uploaded files (from the users computer) will be through the FileReader interface, which, as we know, has the limitation of not allowing us to update the data continuously.

I am hoping that I can make a somewhat elegant user experience that will automatically detect if the file is available on the same file system that the server is running on, and accept continuous updating. If the file is not available, the server will assume that the file is being sent remotely, and simply read a snapshot of the file.

jefffohl commented 8 years ago

Something that I forgot about is that the FileReader interface won't give you any information about the file other than its size and name. It won't tell us the local path to the file, so the server won't be able to find the file.

The alternative is for the user to know the relative path to the local file, and enter that in as a string (the same way that they might enter a URL). The server could see that it is a local path, and then retrieve the file. Again, in this situation, if the app were hosted remotely, a local path would not work. And, now that I think about it, this would be a big security hole if the app were hosted on a public web server, because it would allow users to type in any local path, which would then tell the server to retrieve that file from the local (server's) file system, which is of course a very bad idea.

So, now I need to re-think all of this. Sorry, I should have gone through this logic earlier.

jefffohl commented 8 years ago

So, I've been thinking this over, and I don't see a solution that would involve using the "Browse..." button to allow the user to locate a local file and load it into the app for online streaming. JavaScript in the browser is, by design, sandboxed for security reasons.

We could remove the "Browse..." button and require that all users supply a path to the file they would like to stream - either a local filepath, or a URL to a file hosted on a remote web server. If anyone ever wanted to host this app on a public server, they would need to disable the ability to supply a local filepath, and make the app only accept full URLs. This could be set in the server config.

@breznak what are your thoughts?

breznak commented 8 years ago

@jefffohl ..true about the sandboxed limitation of JS, so does this mean: the publicly hosted app will not be usable? And users will have to provide a path to the file, rather than selecting with with the file browser?

And we are doing it for the "monitoring" support, right?

If so, I'd suggest: A) just require the provided file for inputs does not contain all the points (with appended updates), but rather a diff with the updates only. So we can reread the whole file (no proxy server) each time. Or B) keep the current functionality and have a config that allows the proxy server & imposes the limitations you mention.

jefffohl commented 8 years ago

@breznak - Yes, this is all being done for the monitoring support. What we do depends on what the typical use case is. If this app is most typically run locally on the same machine that is producing the file to read, then using a proxy server will probably be the best option, as it will allow for monitoring both remote and local files.

We can set up the server so that in order for it to monitor local files, the server needs to be started with a special flag. This way the user has to explicitly decide to allow that option, and will therefore hopefully understand that it should not be done on a public web server.

So - we can offer three ways of accessing a file:

A URL to a publicly available file. Can be monitored for updates by using a proxy server. Available whether the app is hosted locally or on a public web server.
A path to a local file. Can be monitored for updates by using the proxy server. This feature should not be enabled on public web servers.
A "Browse..." button that will allow the user to upload a file from their local file system. This file will be handled using the JavaScript FileReader API, which means that we can monitor the file (in Chrome at least), and do a kind of updating that is non-optimal - meaning that we periodically re-upload the entire file, and then slice off the diff and append that to our data in the chart.

For all of these, I still need to test these methods out to make sure I am not missing something important that would prevent our success.

breznak commented 8 years ago

How about this setting:

without a proxy
- functionality as now.
- monitoring works only if the file provided is diffs, so we read it periodically and append to the plotted data
with proxy (can this be detected automatically?)
- works like you described (maybe we could disable monitoring when FileReader is used)
- (should security be concern? running the server w/ privileges to access any file on local FS,..)

jefffohl commented 8 years ago

I am not sure what you mean - you want to make the server optional?

breznak commented 8 years ago

..I was thinking that. Would it be too much work? If we can provide basic monitoring functionality as defaults, and detect the server and if the proxy is present, use its features for file streaming.

jefffohl commented 8 years ago

It is more work. We have to have a server to serve the static resources anyway, so I don't see a benefit at this time. If, in the future, there appears to be a need for decoupling the app from the server then we can work on that feature at that time. As always, I would like development to be driven by real-world needs.

breznak commented 8 years ago

Esp. here I think we have clear real-world usecases: NuPIC live monitoring of a running model and RiverView...

htm-community / nupic.visualizations

As a developer, I need a proxy server, so that I can handle streaming data #78