Dummy script: output format

magsol commented 4 years ago

I think there's a good argument for JSON output format: it's very flexible, allows for hierarchies (which I think is good for embedding "original data" into the projected data), and it's kinda-sorta human-readable.

As for the actual layout of the JSON file, here's what I'm thinking:

top level of the hierarchy: a list of the projected data points
- x, y, and [optional] z components
- [optional] category labels (e.g., the actual number of the MNIST digits)
- link to or embedded binary blob of original data

Just a first stab. Anything else?

submyers commented 4 years ago

The following code generates a dictionary with number keys along with an array of points on the 2D (or 3D) graph generated. So this could be the printJson function (for now assuming access to is either global or passed as an argument to this function):

import codecs, json

M = {0:[],1:[],2:[],3:[],4:[],5:[],6:[],7:[],8:[],9:[]}
i = 0
X_new_list = X_new.tolist()
y_list = y.tolist()
while i < length:
    M[y[i]].append(X_new_list[i])
    i++
json.dump(M, codecs.open(args['output'], 'w', encoding='utf-8'))

This assumes X_new was generated by the function and arguments defined by the user through command line parameters. How does this look? You could also define M by looping through y -- let me know if you want that more general code too...

submyers commented 4 years ago

I got to get use to this ... committing a minute ago dropped my indentation of the while loop.

submyers commented 4 years ago

Oops, I left out the line

length = len(X_new_list)

just before the while-loop begins.

magsol commented 4 years ago

This is a good starting point. Here's my initial feedback, in no particular order:

in theory, you shouldn't need the codecs import to enforce utf-8; that should be built into Python 3
I don't think the top-level indexing should be label-based, because our downstream data isn't necessarily going to have labels. It'd be good to have this as an optional field in the JSON specification, but the way you have it structured right now, it's a required field.
Let's stick with working with NumPy arrays wherever possible; they're a lot more efficient than Python lists.

magsol commented 4 years ago

Specification as laid out in the first post could look something like this:

[
  {'x': <x, y [,z]>, 'y': <None, or label>, 'data': blob},
  {'x': <x, y [,z]>, 'y': <None, or label>, 'data': blob},
  ...
  {'x': <x, y [,z]>, 'y': <None, or label>, 'data': blob},
]

Which, as a Python function, would look something like this:

import json

def dumpJSON(X_proj, X_orig, outfile, y = None):
  out = []
  for index, (x_proj, x_orig) in enumerate(zip(X_proj, X_orig)):
    d = {"x": x_proj, "y": None, "data": x_orig}
    if y is not None:
      d["y"] = y[index]
    out.append(d)

  json.dump(out, open(outfile, "w"))

This is obviously still an oversimplification because X_orig could potentially be a very complicated data structure. For the MNIST dataset, this will be a fairly simple 8x8 image, but for the cilia dataset this will be some kind of FxHxW, where F is the number of frames (usually at least 250), and H and W are the height and width of the video (usually 480x640 or similar). Embedding hundreds of full videos directly into a JSON file will be... problematic, to say the least. We'll likely have to come up with some kind of sym-link system whereby the dashboard can find the full data from some kind of reference stored in the JSON file.

submyers commented 4 years ago

If you place label ([,z]) upstream, it requires less space, The portion is linear relative to the data set. Also, upstream ([,z]) values allow for iterating through all sets with linear time fewer references to correlated colors/shapes assigned to the MNIST dataset expected digital interpretation.

As for passing the full videos, I agree. I believe we could have a different API call for retrieving the correlated data, and we could do that by replacing the "data" value with an "index" value. Javascript could make API calls asking for the raw data found in specified "index" when a user clicks on one of the vertexes in the graph. By that I mean we would create html objects with properties passed to bokeh functions and other values defined by a jQuery (or a more recent, equitable replacement) AJAX GET API call.

How does this sound?

submyers commented 4 years ago

Thank you for fixing my first response, it looks good now!

submyers commented 4 years ago

I checked in a new version of dummy script. Let me know any concerns regarding style and/or functionality.

quasikyo commented 4 years ago

@submyers I have gone ahead and hooked the dummy_mnist.py script into a main.py script that generates a static .html file in the starting_point branch. It's still far from being dashboard worthy, but I feel that it's a good start.

quinngroup / ciliaweb-dashboard

Dummy script: output format #2