ohmage / server

The ohmage server application.
37 stars 25 forks source link

Redesigning JSON rows. #455

Closed jeroen closed 10 years ago

jeroen commented 11 years ago

For survey_response/read, all of the R stuff uses the json-columns output format, which is most efficient. However, some clients prefer a record based format, and also the JSON no-sql databases (e.g. MongoDB) seem to be spitting out this format, so it might be fair to say that row-based encoding is more conventional. See e.g. here for a simple example: http://www.mongodb.org/display/DOCS/Tutorial. Given that Ohmage is a datastore as well, it makes sense to support this kind of output, however it is actually quite different from our current json-rows output format.

Out current json-rows format is very verbose and doesn't scale very well. I tested by downloading data for the urn:campaign:lausd:Jefferson:SP2012:ECS_P6:Snack campaign in both column and row based format. The row based format is more than 10x as large:

jeroen@jeroen-ubuntu:~/R$ ls -ltrh
-rw-rw-r-- 1 jeroen jeroen  49K Nov  2 13:24 columns_mini.json
-rw-rw-r-- 1 jeroen jeroen 757K Nov  2 13:16 rows_mini.json
-rw-rw-r-- 1 jeroen jeroen  70K Nov  2 13:25 columns_pretty.json
-rw-rw-r-- 1 jeroen jeroen 1.2M Nov  2 13:19 rows_pretty.json

The main reason for this is that all of the questions meta-data is repeated for each response in the output. Here an example of the first 2000 lines: http://pastebin.com/s06yyFva.

Perhaps we could consider designing another output format, that tries to put all meta-data together at the beginning of the response, and then have the actual data as an array of key-value pairs, just as e.g. the output from MongoDB.

For example, the MongoDB collection used in the Snackboard dashboard, outputs data like this: http://pastebin.com/wpjV4LfC. (this is actually bson, but you get the point). This is not a complete example of course, because all of the meta-data about the surveys is hardcoded in the application. However it would be nice to be able to retrieve only the output values without all of the meta data. Perhaps we could have a separate call to get the meta-data for the prompts, that a client would only have to call once?

jojenki commented 11 years ago

I agree with most of this. I just did a quick glance at the parameters for survey_response/read, and the 'output_format' is a required parameter. We could use this to our advantage by making it not required and, if not given, make our "new format" the default.

I agree that it could use some cleanup. On one hand, maybe we should take the redundant stuff from the XML and not return it with the responses; on the other hand, maybe we should leave it in and rework the column list.

For Open mHealth, our "column_list" is defined as: "This value must be a comma-separated list of columns. Each column is defined as ":::...". For example, given a record with two top-level fields, "T1" and "T2", where the first field, "T1", has two sub-fields, "S1" and "S2", then supplying the following column list value, "T1:S2,T2", would return "T1" with only one sub-field, "S2", and all of "T2"."

If we reimplemented our column list in this way, we could maintain verboseness in the output and allow users to specify the sections that are relevant to them. Also, we can maintain backwards compatibility by checking if the first field in a column is "urn". If it is, then we use the old "columns_list" definition.

jshslsky commented 11 years ago

@jeroenooms @hongsudt is the idea to have a generic dashboard that will be based on the current prototype work for the snack dashboard?

In general, I am all in favor of removing redundancy.

@jojenki do you mean that if we used omh/read for survey responses instead of survey_response/read that we'd get some of the functionality that Jeroen is asking for?

jojenki commented 11 years ago

@joshuaselsky That depend on which parameters he is using on survey_response/read as they may not be available through omh/read, but that could definitely be an interim solution to try it out. I am not sure if we should get rid of survey_response/read, though.

jshslsky commented 11 years ago

I think we should keep survey_response/read -- it's in-line with saying all omh DSUs can have their own parallel APIs.

jojenki commented 10 years ago

I believe this is simply dated and all of the issues brought up in here will be addressed in 3.0. @joshuaselsky, please reopen if I am incorrect.

jeroen commented 10 years ago

It is although I do have some opinions on this once you are going to work on the 3.0 equilvalent of survey_respone/read

jojenki commented 10 years ago

Ok. I already have a alpha/beta version internally. Maybe we should start a new thread like, "Considerations when reading Survey Responses"?