Closed pcarbo closed 5 years ago
This could potentially replace the (undocumented) function read_dsc
.
So you are proposing a two step procedure where we first figure out the filenames then load it this way. I was thinking of something along the lines of loading all outputs at the dscquery
interface. So if a user is very specific about the condition
s/he is interested in, then it will end up a small (a few rows, or one row) tibble
object where all outputs can be easily examined. Users do not have to deal with filenames in this case.
@gaow To me this seems like two different use cases, so having two different functions for this seems more appropriate. I agree that your approach is a bit more elegant, and avoids showing the file names, which is also nice. On the other hand, if you need to debug, sometimes it is more desirable to see directly what is going on under the hood.
My proposal also has the advantage that it would not involve redesigning dscquery
, which we have already done once, and I am not excited about doing it again.
My proposal also has the advantage that it would not involve redesigning dscquery
I think we can get away with hardly any changes to dscquery
function. But just extend it along the lines of your proposal #180 output.files
, we have an additional parameter load.output
to specify the modules to load. The inside dscquery
instead of setting item = filename
we set item = read_dsc(filename)
-- change to the code will be minimal.
@gaow The problem, as I see it, is that we won't know in advance what are the targets. So dscquery will have to learn on the fly what the targets are, and then update the list / data frame accordingly. This is not how things are currently done, where the targets are known in advance. Further, since the outputs might be different from each file, we would have to have additional steps to fill in missing results with NAs. So yes I believe it would require a substantial redesign.
@pcarbo my point is we do not need to learn anything. Just dump all the output to one imaginary tibble
cell. Specifically, what's the issue of just replacing item = filename
we set item = read_dsc(filename)
, isnt that doable?
Just dump all the output to one imaginary tibble cell.
But then it will not be in the right format for additional condition-based filtering (among other issues).
And we aren't currently using tibbles. Do you mean a list element?
But then it will not be in the right format for additional condition-based filtering (among other issues).
condition-based filtering for module parameters happens before this step I believe. Might be an issue for condition-based filtering module output but it is not different from currently asking for a list module output (non-atomic); that is, we are not creating a new problem, but rather fall into the same pattern of an existing limitation. This also can easily be done after uses convert list to tibble
if our decision of not to use tibble
by default remains.
And we aren't currently using tibbles. Do you mean a list element?
yes that is what i meant by "one imaginary tibble cell".
@gaow I guess this can be done. I think it does make the dscquery
interface yet more complicated to explain, and will require some additional special cases to handle (e.g., conditions as I mentioned).
But how would we handle this use case (which I think would be quite common)?
I run dscquery
, and I decide I would like to inspect the full set of results from row 17. How do I do this? If I know the file name, it is easy. How would I accomplish this in dscquery as you have proposed it?
I run dscquery, and I decide I would like to inspect the full set of results from row 17. How do I do this?
with your proposal, it'd be like:
dat = dscquery(dsc.dir, target = "...", output.file = "some_module")
res = read_dsc(dat[17,"some_module"])
with my proposal,
dat = as_tibble(dscquery(dsc.dir, target = "...", load.output = "some_module"))
res = dat[17, "some_module"]
and will require some additional special cases to handle (e.g., conditions as I mentioned).
We are already handling the case when condition is a module output. Indeed it is a special case, but an existing, rather than an additional special case if we decide to load all output.
@gaow But here you are loading all outputs. I ONLY what to load the output from row 17. (Suppose the outputs are very large, and wouldn't all fit into memory at once, which is a very realistic scenario.)
Well the reason you come up with a specific row number is curious and possibly unrealistic. I assume first round users will run it anyways without output.file
. Then they decide to look at row 17 which corresponds to some condition
on parameters they are interested in; and they can apply that condition in dscquery
to avoid loading it all. Of course the constraint would be if this condition is based on module output then in order to be able to query on it, we will have to load everything first. But again this problem still persists for loading output.file
, only that loading output.file
(filenames) can be a lot lighter operation.
Just to clarify, am not against your proposal at all. Because all we need to do is to improve and document read_dsc
. I was just saying adding my proposal is one line of change on the interface and two lines in the code -- although of course small changes to code does not always mean we should add the feature. I'm also cool if you table this and focus on documenting read_dsc
@gaow Here's my proposal:
First we update dsc-query
to address Issue #180 ("Output only requested targets").
Remove the omit.filenames
argument.
We provide an argument for requesting file names (e.g., output.file
) as an option for intermediate/advanced users who would like to access individual output files, and provide a more user-friendly version of read_dsc
.
We provide an argument for requesting file output (e.g., load.output
).
I think this would cover all the use cases, and make the interface more straightforward than it was before.
If you are in agreement with this proposal, I can handle items 2–4 if you can take care of item 1.
Thanks @pcarbo
First we update dsc-query to address Issue #180 ("Output only requested targets").
I think it is best achieved in dscquery
R code. dsc-query
does not differentiate between required, not required or file targets. They are conceptually all the same. It fits in the scope of dscquery
, utility to interpret the output of dsc-query
(the single purpose principle in software programming). Unless i'm missing something, eg, current dsc-query
to dscquery()
communication misses some of the information?
Remove the omit.filenames argument. We provide an argument for requesting file names (e.g., output.file)
yes I agree.
We provide an argument for requesting file output (e.g., load.output).
So this is a boolean variable that only works with output.file
?
I think it is best achieved in dscquery R code.
That's fine, we can do that.
We provide an argument for requesting file output (e.g., load.output).
Here's an example call:
out <- dscquery("linreg",
targets = "mse.err",
targets.notreq = "fit.betaest",
output.file = "fit"
load.output = "predict")
Thanks @pcarbo the interface looks good. We might need to revisit targets
in light of #184 -- I'm asking for more examples to back it up. We can discuss in person this afternoon.
One possibility would to remove "targets" entirely (and rename "targets.notreq" as "targets"), so that NAs are allowed all the time, then use conditions
to filter out NAs as desired. This would streamline the interface without sacrificing flexibility.
I like this idea. I think it is clearer to filter out pipeline explicitly using conditions than implicitly via other arguments.
I don't understand what load.output would do though...
On Thu, Apr 25, 2019, 08:35 Peter Carbonetto notifications@github.com wrote:
One possibility would to remove "targets" entirely (and rename "targets.notreq" as "targets"), so that NAs are allowed all the time, then use conditions to filter out NAs as desired. This would streamline the interface without sacrificing flexibility.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/182#issuecomment-486724729, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRONPOQ7YP57ONKOJKLPSHFUNANCNFSM4HGWLKCA .
I don't understand what load.output would do though...
This is for people to load the entire output object if they want to access something more complicated in the result of the module but not module output. This is similar to what you had in mind for load_example
but it will in fact load many examples, and users can then convert the output to tibble
table and query the specific examples to look into.
but what would out
be in this example?
would be a dataframe or something else?
It seems loading the outputs is a different task than returning . a
simple dataframe/list... so I'm not
sure why load.output
is a parameter of dscquery
On Mon, Apr 29, 2019 at 1:22 PM gaow notifications@github.com wrote:
I don't understand what load.output would do though...
This is for people to load the entire output object if they want to access something more complicated in the result of the module but not module output. This is similar to what you had in mind for load_example but it will in fact load many examples, and users can then convert the output to tibble table and query the specific examples to look into.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
It seems loading the outputs is a different task than returning a simple dataframe/list...
That's what I thought, too, initially. (See the discussion above---I said nearly the same thing.)
But Gao's idea is that this would be equivalent to requesting all the outputs (and parameters?) for a given module or module group.
So the return value would always be a list (I believe), and the list would contain list elements that are also themselves lists (containing the full output from a single DSC pipeline).
Note that the alternative would be to provide this in a separate function, which is perhaps more intuitive, but has the disadvantage that it would be a two-step process: (1) get the output file name, using dscquery
(2) load the results from the file using dscread
.
@stephens999 I want to make sure you are okay with the interface change before I implement it since it is a fairly big change.
Let's talk in person
On Mon, Apr 29, 2019, 16:28 Peter Carbonetto notifications@github.com wrote:
Note that the alternative would be to provide this in a separate function, which is perhaps more intuitive, but has the disadvantage that it would be a two-step process: (1) get the output file name, using dscquery (2) load the results from the file using dscread.
@stephens999 https://github.com/stephens999 I want to make sure you are okay with the interface change before I implement it since it is a fairly big change.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/182#issuecomment-487750754, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRONUGDK7DPALKINRWDPS5R7HANCNFSM4HGWLKCA .
Here's an example usage of the function which I am proposing to call "dscload":
This function shouldn't be hard to implement,