skiadas / PanthR

Statistics front-end and webserver with R connection
1 stars 2 forks source link

API Review Needed (Data Objects section) #20

Open skiadas opened 11 years ago

skiadas commented 11 years ago

I made another pass at some parts of the API, looking for feedback. Specifically the general page at:

The pages to look at are in the Wiki, here and here

The data objects info there should cover all the types of data we'll want to deal with. Things like tests and other kinds of reports would be a different section. In particular, I'm looking for things I might have missed or overlooked. Leave comments here, check your name off the list and assign it to someone else.

We should start using our GitHub names

altermattw commented 11 years ago

Haris, what would you think about using the wiki to organize info like what you've got in the API webpage? I thought it might be easier to make comments or annotations than referring to an external site.

altermattw commented 11 years ago

I think it might be useful define the objects less in terms of their origin (entered data vs. results) and more in terms of their structure. Many statistical tests will produce output that can be stored as a data frame, as you note: rows refer to variables (or terms in a linear equation) and columns refer to some operation on the variables.

Even more flexible is a list, with some elements being just a key and a value (e.g., dv = income) and other elements being data frames (predicted values, residuals). Lists seem to be the norm with R output, and much of my work with the RichOutput project involved finding the default print method for those lists and then modifying them into HTML. The nice thing about trying to use dataframes for both original data and results is that it would enable subsequent analysis of those results. The output could become input. For example, the user could open a dataframe of regression output and sort by the size of the coefficients to see which are the strongest, or filter out non-significant results using a filter.

skiadas commented 11 years ago

On Feb 23, 2013, at 2:05 PM, Bill Altermatt wrote:

Haris, what would you think about using the wiki to organize info like what you've got in the API webpage? I thought it might be easier to make comments or annotations than referring to an external site.

Hm that's a thought. It would definitely make some things easier. It will take me some time to find the best way to set it up as a repository, so that I might be able to keep the files where we have them now with the advantages that has but also on the wiki, and the two sides being synchronized. But it would make editing it easier.

There's a lot of file interlinking right now, I'll have to see how that plays out on the wiki.

You can also view and edit the .rst files directly, they are under /docs/source and fairly readable on a plain text editor. You can then easily add a comment there by for instance starting a new line and typing something like:

@altermattw: Maybe this should be so and so and …. Then just do a commit with those file changes.

Won't look too pretty, but it works, and we can easily clean those comments up later.

skiadas commented 11 years ago

While lists can be used in such a flexible way, data frames are much more rigid in their structure. They are also associated with the specific way we expect a data frame to appear in the screen. Anything marked as a dataset would show up in the system as part of the list of all datasets, even though it's actually more of an output. If we make lists more prevalent, as in users could create them and manipulate them etc, then maybe we can expose most structures as lists, and it is in large part what R does, but I am not certain it's something intuitive to especially entry level users. I would expect such users to be confused even by the presence of a single variable outside of a dataset, much less more complex things.

But one important point here is that the API is not so much about the internal structure we use to implement these things, but more about what language should be used to talk about them. For instance should the resulting object from a test be identified as such, to distinguish it from a plain list? Wouldn't we expect the UI to show a test result object in some different way than a plain list object? The components of a test result object each would have a certain meaning to them, which the UI should be able to respect.

Even in R this happens. For instance the result of a t-test is a is a "htest" class, which at its heart is just a list, but R has methods on what to do with it, how to print it, how to make a summary of it, etc.

At their heart all of those objects are lists, that is after all what every Javascript object is.

But you do bring an important point, we should offer ways to turn the result objects into new data objects, if the user wants to, for further processing.

Here's a simple example of a crosstabs table that might be an output:

+----------+-----+------+
| Table    |  C  |   D  |
+==+=+=====+=====+======+
|  | |Count|  53 |  32  |
|  |M+-----+-----+------+
|  | |Perc |73.6%|48.5% |
|A +-+-----+-----+------+
|  | |Count|  19 |  34  |
|  +F+-----+-----+------+
|  | |Perc |26.4%|51.5% |
|  +-+-----+-----+------+
|  |Tot    |  72 |  66  |
+--+-------+-----+------+

I am not certain how to represent that as a data frame, and I'm not sure a list would do it justice. The different types of values on some cells make it not very appealing to make it not a dataset. But for instance it could be "exported" into a dataset, with something like:

A        C       D
----   ----    ------
M      53      32
F      19      34
skiadas commented 11 years ago

Okay I was able to sort move everything to the Wiki. I'm not very happy with its capabilities, and when we're closer to actually building the documentation I'll move it back, but it will serve us well for now I think. Feel free to edit the pages. They are still in reStructuredText format, the main points of which you can find here:

http://docutils.sourceforge.net/docs/user/rst/quickref.html