stephenslab / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
12 stars 12 forks source link

Implement user-friendly function in dscrutils to load a DSC result #182

Closed pcarbo closed 5 years ago

pcarbo commented 5 years ago

Here's an example usage of the function which I am proposing to call "dscload":

dsc.dir <- system.file("datafiles","one_sample_location","dsc_result",
                       package = "dscrutils")
dat <- dscquery(dsc.dir,targets = c("analyze","score"))
head(dat)
#   DSC analyze      analyze.output.file   score                   score.output.file
# 1   1    mean     mean/normal_1_mean_1 abs_err   abs_err/normal_1_mean_1_abs_err_1
# 2   1  median median/normal_1_median_1 abs_err abs_err/normal_1_median_1_abs_err_1
# 3   1    mean          mean/t_1_mean_1 abs_err        abs_err/t_1_mean_1_abs_err_1
# 4   1  median      median/t_1_median_1 abs_err      abs_err/t_1_median_1_abs_err_1
# 5   1    mean     mean/normal_1_mean_1  sq_err     sq_err/normal_1_mean_1_sq_err_1
# 6   1  median median/normal_1_median_1  sq_err   sq_err/normal_1_median_1_sq_err_1
outputfile <- dat$score.output.file[1]
print(outputfile)
# "abs_err/normal_1_mean_1_abs_err_1"
out <- dscload(dsc.dir,outfile)
names(out)
# "error" "DSC_DEBUG"

This function shouldn't be hard to implement,

pcarbo commented 5 years ago

This could potentially replace the (undocumented) function read_dsc.

gaow commented 5 years ago

So you are proposing a two step procedure where we first figure out the filenames then load it this way. I was thinking of something along the lines of loading all outputs at the dscquery interface. So if a user is very specific about the condition s/he is interested in, then it will end up a small (a few rows, or one row) tibble object where all outputs can be easily examined. Users do not have to deal with filenames in this case.

pcarbo commented 5 years ago

@gaow To me this seems like two different use cases, so having two different functions for this seems more appropriate. I agree that your approach is a bit more elegant, and avoids showing the file names, which is also nice. On the other hand, if you need to debug, sometimes it is more desirable to see directly what is going on under the hood.

My proposal also has the advantage that it would not involve redesigning dscquery, which we have already done once, and I am not excited about doing it again.

gaow commented 5 years ago

My proposal also has the advantage that it would not involve redesigning dscquery

I think we can get away with hardly any changes to dscquery function. But just extend it along the lines of your proposal #180 output.files, we have an additional parameter load.output to specify the modules to load. The inside dscquery instead of setting item = filename we set item = read_dsc(filename) -- change to the code will be minimal.

pcarbo commented 5 years ago

@gaow The problem, as I see it, is that we won't know in advance what are the targets. So dscquery will have to learn on the fly what the targets are, and then update the list / data frame accordingly. This is not how things are currently done, where the targets are known in advance. Further, since the outputs might be different from each file, we would have to have additional steps to fill in missing results with NAs. So yes I believe it would require a substantial redesign.

gaow commented 5 years ago

@pcarbo my point is we do not need to learn anything. Just dump all the output to one imaginary tibble cell. Specifically, what's the issue of just replacing item = filename we set item = read_dsc(filename) , isnt that doable?

pcarbo commented 5 years ago

Just dump all the output to one imaginary tibble cell.

But then it will not be in the right format for additional condition-based filtering (among other issues).

And we aren't currently using tibbles. Do you mean a list element?

gaow commented 5 years ago

But then it will not be in the right format for additional condition-based filtering (among other issues).

condition-based filtering for module parameters happens before this step I believe. Might be an issue for condition-based filtering module output but it is not different from currently asking for a list module output (non-atomic); that is, we are not creating a new problem, but rather fall into the same pattern of an existing limitation. This also can easily be done after uses convert list to tibble if our decision of not to use tibble by default remains.

And we aren't currently using tibbles. Do you mean a list element?

yes that is what i meant by "one imaginary tibble cell".

pcarbo commented 5 years ago

@gaow I guess this can be done. I think it does make the dscquery interface yet more complicated to explain, and will require some additional special cases to handle (e.g., conditions as I mentioned).

But how would we handle this use case (which I think would be quite common)?

I run dscquery, and I decide I would like to inspect the full set of results from row 17. How do I do this? If I know the file name, it is easy. How would I accomplish this in dscquery as you have proposed it?

gaow commented 5 years ago

I run dscquery, and I decide I would like to inspect the full set of results from row 17. How do I do this?

with your proposal, it'd be like:

dat = dscquery(dsc.dir, target = "...", output.file = "some_module")
res = read_dsc(dat[17,"some_module"])

with my proposal,

dat = as_tibble(dscquery(dsc.dir, target = "...", load.output = "some_module"))
res = dat[17, "some_module"]

and will require some additional special cases to handle (e.g., conditions as I mentioned).

We are already handling the case when condition is a module output. Indeed it is a special case, but an existing, rather than an additional special case if we decide to load all output.

pcarbo commented 5 years ago

@gaow But here you are loading all outputs. I ONLY what to load the output from row 17. (Suppose the outputs are very large, and wouldn't all fit into memory at once, which is a very realistic scenario.)

gaow commented 5 years ago

Well the reason you come up with a specific row number is curious and possibly unrealistic. I assume first round users will run it anyways without output.file. Then they decide to look at row 17 which corresponds to some condition on parameters they are interested in; and they can apply that condition in dscquery to avoid loading it all. Of course the constraint would be if this condition is based on module output then in order to be able to query on it, we will have to load everything first. But again this problem still persists for loading output.file, only that loading output.file (filenames) can be a lot lighter operation.

gaow commented 5 years ago

Just to clarify, am not against your proposal at all. Because all we need to do is to improve and document read_dsc. I was just saying adding my proposal is one line of change on the interface and two lines in the code -- although of course small changes to code does not always mean we should add the feature. I'm also cool if you table this and focus on documenting read_dsc

pcarbo commented 5 years ago

@gaow Here's my proposal:

  1. First we update dsc-query to address Issue #180 ("Output only requested targets").

  2. Remove the omit.filenames argument.

  3. We provide an argument for requesting file names (e.g., output.file) as an option for intermediate/advanced users who would like to access individual output files, and provide a more user-friendly version of read_dsc.

  4. We provide an argument for requesting file output (e.g., load.output).

I think this would cover all the use cases, and make the interface more straightforward than it was before.

If you are in agreement with this proposal, I can handle items 2–4 if you can take care of item 1.

gaow commented 5 years ago

Thanks @pcarbo

First we update dsc-query to address Issue #180 ("Output only requested targets").

I think it is best achieved in dscquery R code. dsc-query does not differentiate between required, not required or file targets. They are conceptually all the same. It fits in the scope of dscquery, utility to interpret the output of dsc-query (the single purpose principle in software programming). Unless i'm missing something, eg, current dsc-query to dscquery() communication misses some of the information?

Remove the omit.filenames argument. We provide an argument for requesting file names (e.g., output.file)

yes I agree.

We provide an argument for requesting file output (e.g., load.output).

So this is a boolean variable that only works with output.file?

pcarbo commented 5 years ago

I think it is best achieved in dscquery R code.

That's fine, we can do that.

We provide an argument for requesting file output (e.g., load.output).

Here's an example call:

out <- dscquery("linreg",
                targets        = "mse.err",
                targets.notreq = "fit.betaest",
                output.file    = "fit"
                load.output    = "predict")
gaow commented 5 years ago

Thanks @pcarbo the interface looks good. We might need to revisit targets in light of #184 -- I'm asking for more examples to back it up. We can discuss in person this afternoon.

pcarbo commented 5 years ago

One possibility would to remove "targets" entirely (and rename "targets.notreq" as "targets"), so that NAs are allowed all the time, then use conditions to filter out NAs as desired. This would streamline the interface without sacrificing flexibility.

stephens999 commented 5 years ago

I like this idea. I think it is clearer to filter out pipeline explicitly using conditions than implicitly via other arguments.

I don't understand what load.output would do though...

On Thu, Apr 25, 2019, 08:35 Peter Carbonetto notifications@github.com wrote:

One possibility would to remove "targets" entirely (and rename "targets.notreq" as "targets"), so that NAs are allowed all the time, then use conditions to filter out NAs as desired. This would streamline the interface without sacrificing flexibility.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/182#issuecomment-486724729, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRONPOQ7YP57ONKOJKLPSHFUNANCNFSM4HGWLKCA .

gaow commented 5 years ago

I don't understand what load.output would do though...

This is for people to load the entire output object if they want to access something more complicated in the result of the module but not module output. This is similar to what you had in mind for load_example but it will in fact load many examples, and users can then convert the output to tibble table and query the specific examples to look into.

stephens999 commented 5 years ago

but what would out be in this example? would be a dataframe or something else?

It seems loading the outputs is a different task than returning . a simple dataframe/list... so I'm not sure why load.output is a parameter of dscquery

On Mon, Apr 29, 2019 at 1:22 PM gaow notifications@github.com wrote:

I don't understand what load.output would do though...

This is for people to load the entire output object if they want to access something more complicated in the result of the module but not module output. This is similar to what you had in mind for load_example but it will in fact load many examples, and users can then convert the output to tibble table and query the specific examples to look into.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

pcarbo commented 5 years ago

It seems loading the outputs is a different task than returning a simple dataframe/list...

That's what I thought, too, initially. (See the discussion above---I said nearly the same thing.)

But Gao's idea is that this would be equivalent to requesting all the outputs (and parameters?) for a given module or module group.

So the return value would always be a list (I believe), and the list would contain list elements that are also themselves lists (containing the full output from a single DSC pipeline).

pcarbo commented 5 years ago

Note that the alternative would be to provide this in a separate function, which is perhaps more intuitive, but has the disadvantage that it would be a two-step process: (1) get the output file name, using dscquery (2) load the results from the file using dscread.

@stephens999 I want to make sure you are okay with the interface change before I implement it since it is a fairly big change.

stephens999 commented 5 years ago

Let's talk in person

On Mon, Apr 29, 2019, 16:28 Peter Carbonetto notifications@github.com wrote:

Note that the alternative would be to provide this in a separate function, which is perhaps more intuitive, but has the disadvantage that it would be a two-step process: (1) get the output file name, using dscquery (2) load the results from the file using dscread.

@stephens999 https://github.com/stephens999 I want to make sure you are okay with the interface change before I implement it since it is a fairly big change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/182#issuecomment-487750754, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRONUGDK7DPALKINRWDPS5R7HANCNFSM4HGWLKCA .