mani2012 / PathoStat

The purpose of this package is to perform Statistical Analysis on the PathoScope generated reports files.
8 stars 9 forks source link

Distinguish between discrete and continuous variables #7

Open tfaits opened 7 years ago

tfaits commented 7 years ago

In its current form, PathoStat accepts "batch" and "condition" as possible discrete variables, and gives the user the option to color/group data (in various plots) by either of those. However, we're adding functionality: PathoStat will accept any number of covariates, such as patient age, weight, race, disease status, whatever. We still want to let users color/group data based on these things, but that doesn't make much sense for continous variables. Without binning, how do you group people by weight? You can, however, order data by continuous variables. We want to at least distinguish between the two types, and we may want to add functionality for continuous variables.

mlbendall commented 7 years ago

I agree with this, I am running up against the same issue now. If you are just looking for the types as currently assigned, you can do this:

sapply(sample_variables(pstat), function(v) { class(sample_data(pstat)[[v]]) })

However, I think we need to be explicit in assigning types to sample variables. A function should be implemented that accepts user input to assign types, or attempts to infer from the data. Inferring may not be 100% accurate. For example, R (read.table or similar) interprets "Subject ID" as an integer, but it should be a factor, since there is no meaningful ordering to the subjects. Still, inferring from the data would be a good first step.

I propose we have more than two types. I think our types should be according to the standard R data types:

These types will naturally suggest how to display them. For example, factors can be displayed using "select" inputs and qualitative color palettes, while ordered factors may also use "select" inputs but be displayed with sequential color palettes.

In addition, users should be able to indicate which covariates are "of interest". Perhaps there should be several categories, such as secondary/confounders, batch covariates, and random effects.