Closed gaow closed 7 years ago
1a. First, I don't think it should be used (or at least not encouraged) in defining the output of modules. So not:
Omega = .logic(Omega if .P < .N else 0)
If the output of the module depends on the parameter settings then that should be encoded
in the executable of the module, not in the dsc module definition, which should really
serve as just a "wrapper" that says which variables go where.
In this case I think the pipeline variable $meta.n_glasso
should be entirely removed - it does not make sense for that to be part of the simulation scenario, whether or not we allow logic
operators! The modules glasso1, glasso2, glasso3
will have to make their own decision on what K
to use, and that should be encoded in the executables not in the module definition.
On a related note, it is cheating to pass $meta$K
into flash
if that is the true number of
factors. It may be acceptable to pass it into a "gold standard" method that is acknowledged
as cheating, but done as a way of comparing against a method that has additional information.
But that should be a separate module (flash_cheat
say?)
And it seems you don't need to pass P
and N
in $meta
either - they are implicitly
part of the test and training data, and not used later as far as I can see?
And the simulation executables themselves should be responsible
for returning NA
for Omega
if it does not exist.
minor: diagnoal typo
more later...
minor: it would be clearer if $Omega output from methods were called $Omega_est
minor: instead of defining "tiger" as the template, clearer to define "method-template" as a template (with no exec) and then have tiger (and other methods) inherit from that template.
minor: in score, return
should be output
, and score:
should be $score:
I'm not sure what you are trying to achieve with the other logic statements, which occur in the module definitions for the simulation scenarios and in the define statements.
It seems there are a few things:
I think both could be better achieved in different ways. For now I'll just say: i) Neither of them seem to require logic statements in the actual module definitions, so I would get rid of those. ii) naming distinct subsets in different ways should probably be taken care of downstream in the extraction ? iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.
Also, do we really need to run subsets here? Can we just run everything?
@stephens999 Thank you for the feedback!
But that should be a separate module (flash_cheat say?) And it seems you don't need to pass P and N in $meta either minor: it would be clearer if $Omega output from methods were called $Omega_est minor: in score, return should be output, and score: should be $score: minor: diagnoal typo
Fixed.
We are "cheating" a lot here mostly via .logic
because @NKweiwang think this DSC is more for "exploration", that we don't hard-code any rules in modules; rather we use DSC as a convenient interface to pass in parameters (based on some conditions from meta) and study the outcome to understand things better, before finalizing the module codes. The up-side for this is that the module code themselves would be less subjective in that it does not dictate choices of parameters based on input. I think the debate here is more fundamental than DSC sytnax itself. I'm going to split this post to DSC related
and Omega
related issues. I'm going to post DSC related here, and Omega related in the Omega repo.
.logic
in input
. Do we want .logic
in output
? instead of defining "tiger" as the template, clearer to define "method-template" as a template
I think it is a matter of styling, as in object oriented programming someone would prefer base
class do nothing, some would prefer base
class itself is usable. I'm cool either way (I also program either way). So are we suggesting that if a module does not have exec
then it is a template module? I take that a template module can also template from another template module?
I'd like to add that we do want to be able to derive from some non-template modules, see:
https://github.com/stephenslab/dsc2-omega/blob/master/dsc2/omega.dsc#L87
So I'm viewing the notion of template module a bit redundant.
I'm not sure what you are trying to achieve with the other logic statements, It seems there are a few things: running only a subset of pipelines; giving distinct names (tags?) to different subsets of modules
Exactly. Sometimes we only want pairs of parameters to appear. We may well write:
np: (20, 100), (100, 20)
But this will make downstream query difficult. Even storing the info in R would be difficult: imaging a np
column storing a pair of values in data.frame. I can only think of storing it as a string.
ii) naming distinct subsets in different ways should probably be taken care of downstream in the extraction ?
Same argument: yes we can, but that may make extraction difficult. A big motivation of adding tags is to make downstream query easier, and .logic
is a way to add such tags.
i) Neither of them seem to require logic statements in the actual module definitions,
iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.
I'm not following these 2 comments ... would you give an example at your convenience? I've address other concerns in this post and in https://github.com/stephenslab/dsc2-omega/issues/1 which will involve @NKweiwang 's input.
i) Neither of them seem to require logic statements in the actual module definitions
What I am saying here is that the two "legitimate" (to my mind) uses of logic statements are to i) name subsets of modules; ii) select subsets of pipelines to run. Neither of these legitimate uses require the use of logic statements within the module definitions (eg in input: or output:) Thus my initial suggestion is for dsc not to support logic statements within the module definitions. It seems that such statements are effectively the start of a "mini-language" that allows the module definitions to actually control the behavior of the modules, rather than their intended purpose which is to effectively map variables between pipeline and module executable. I think this is feature creep we want to avoid.
In the future, if we were to want this kind of functionality (adding code to existing executable), I would suggest incorporating "prologue" and "epilogue"s to the modules that run R or python code either before or after the code in the executable. Indeed, you effectively already have that kind of functionality when you use R() in the output.... I'm not quite sure how general your current implementation is in terms of what R code can be run here. But for now in any case I think we defer this issue, and remove the logic statements from within the modules in this example.
iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.
What I am saying is that, as we have discussed, we need syntax for extracting results for subsets of pipelines, and that we should use the same syntax for specifying which pipelines to run.
As I understand it the syntax for extracting results for subsets of pipelines is not yet implemented but we have discussed SQL-like syntax where you would have things like WHERE (simulate.n < 100 & simulate.K>10) to extract subsets of pipeline results.
So we can have the same type of syntax in defining subsets of pipelines to run... However, I don't have the syntax sufficiently clear in my head to give an example.
Thanks @stephens999 it is clear to me now. Combining your i) and iii) I see that we should not use syntax such as logic
in module definitions, but we certainly can achieve the same when we execute a pipeline (eg in dsc-omega example we restrict runs to only allowing for n<p case, or we set K according to value of p -- though whether or not this "cheating" should be done here is discussion of another ticket https://github.com/stephenslab/dsc2-omega/issues/1). The question is the syntax, and whether it should be done via command interface or DSC::define/run
, or both. I'll update omega example as we discuss more on the other ticket.
it seems that this is part of a larger issue: within the DSC section, what are we defining?
For example, in the run:
section you are currently defining things (it seems to me).
Do we really need both run
and define
?
Maybe the point is that some things you define are not runnable, whereas others are?
I suggest that in the dsc file you just want to define things, and say what to run at command time.
Like a makefile defines targets, but you say what to build with make all
etc.
The difference here is that you want to define things that are not targets?
Do we really need both run and define ?
Yes. define
groups modules into ensembles and allow a chance to use some logic
. run
would be a very clear line of sequence to execute. Additionally users can overwrite run
from command line. This is good in research mode when we want to test things out, and also as with the case of dsc-omega when we write up the entire benchmark but wish to execute them bit by bit.
Maybe the point is that some things you define are not runnable, whereas others are?
Only things involving modules that do not take pipeline variable inputs are runable.
in the dsc file you just want to define things, and say what to run at command time.
This will jeopardize reproducibility because in that case one cannot tell by looking at the script what is runnable. Even makeifle
has to define a make all
to explicitly say what all
means. With file targets this logic can be implicit. But unfortunately here we need to be explicit, thus the need for run
let's start with run:
What is the purpose of the run
section:
To define pipelines that could be run, or to actually define
things that are to be run?
Or maybe both: to define pipelines that could be run, with the default being to do all of them?
That is, if we have
run:
s1: sim1 method score
s2: sim2 method score
then at run time, we have something like dsc # runs s1 and s2 dsc s1 # runs s1? (probably there is a flag i don't know?)
If this is the case I prefer to think of this as defining pipelines
(groups/ensembles of pipelines? dscs?)
not running them...that is I think
target:
or pipeline:
or dsc:
might be better keywords.
I understand you also want to be able to run a subset of pipelines. It would seem to be desirable to be able to specify that at runtime...
eg:
dsc s1 WHERE sim1.n==100
?
This gets at a different issue though - it relates to the fact that we want a general syntax for specifying subsets of pipelines, and use the same syntax for running subsets and extracting results from subsets.
Matthew
On Tue, May 9, 2017 at 9:27 PM, gaow notifications@github.com wrote:
Do we really need both run and define ?
Yes. define groups modules into ensembles and allow a chance to use some logic. run would be a very clear line of sequence to execute. Additionally users can overwrite run from command line. This is good in research mode when we want to test things out, and also as with the case of dsc-omega when we write up the entire benchmark but wish to execute them bit by bit.
Maybe the point is that some things you define are not runnable, whereas others are?
Only things involving modules that do not take pipeline variable inputs are runable.
in the dsc file you just want to define things, and say what to run at command time.
This will jeopardize reproducibility because in that case one cannot tell by looking at the script what is runnable. Even makeifle has to define a make all to explicitly say what all means. With file targets this logic can be implicit. But unfortunately here we need to be explicit, thus the need for run
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300355955, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xX1kKIP_yhaKdzPtysY9KIS7OLu9ks5r4SCMgaJpZM4NKoCy .
that is I think
target:
orpipeline:
ordsc:
might be better keywords.
I confirm everything you said prior to this sentence is true.
that is I think
target:
orpipeline:
ordsc:
might be better keywords.
Certainly there will be more work to use unified command interface for execution and extraction -- we decide to utilize SQL interface as well as some of its power to make data extractions. But execution does not involve SQL.
Let's leave alone engineering issues for a moment and focus on the concepts. Also let's stick to the terms define
and run
for the sake of discussion. With logic
I can have a chance to define a more refined module. eg: something like:
define:
big_data: data.logic(N < P)
This is not possible with simple module specification because we run combinations of N and P by default. So the keyword define
has two usages:
This is why I see logic
and define
useful. They make what we want to compare much clearer.
Now about run
: yes essentially it is just make all
. And it can be left blank if you want: the official interface for run
is:
dsc settings.dsc ... -s "sim * voom"
where you see I intentionally only run an incomplete pipeline. So if you prefer you can leave the entire run
section out. This is totally fine. I insisted having run
in dsc the same reason I think it makes sense to define make all
. But it is perhaps different work style.
Whether or not to have -s
switch support WHERE
statement is a different topic of discussion and is always an enhancement we can make after we implemented what we have discussed.
regarding:
With
logic
I can have a chance to define a more refined module. eg: something like:define: big_data: data.logic(N < P)
Here I understand data
is an ensemble (or group) of modules
(this concept we have not yet fully defined, but roughly a collection of similar modules). - one for each (N,P)
combination,
and you are just subsetting that ensemble/group
That is, you are not "defining a more refined module" but subsetting an ensemble/groupdata
to define a new (more refined) ensemble/group big_data
.
If we treat a module as (logically) a special case of a pipeline (one with just one module) then the natural generalization of what you want to be able to do here is specify subsets of groups of pipelines (roughly a set of pipelines).
Effectively I think we have three operations that can be applied to existing groups of pipelines (which includes a module as the simplest possible unit) to create new groups of pipelines:
i) grouping: (p1,p2)
ii) concatenation: p1*p2
iii) subsetting: p1 WHERE xx
(possible values of xx to be discussed?)
[Here p1 and p2 are groups of pipelines]. Here grouping creates a union of two groups and concatenation is the set of all (ordered) combinations of two groups (is there a math word for this? there must be...). Effectively these are all set operations (maybe we should use set instead of group or ensemble?)
My vision is that iii) might support things like:
simulate: (normal,t,cauchy)
method: (mean,median)
score: (mse,mae)
all: simulate * method * score
small_n: all WHERE simulate.n<100
mean_only: all WHERE method.id == mean
hmm... maybe a lot to digest here... but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!) And that a similar idea is required for extracting results from subsets of pipelines, so we should try to use the same syntax...
Here I understand data is an ensemble (or group) of modules
It is not the case. Sorry I should write it in full
data:
exec: simulate.R
input:
N: 200, 300, 400
P: 50, 100, 200, 1000
output:
$Xtrain: Xtrain
$Xtest: Xtest
...
define:
big_data: data.logic(N <= P)
regular_data: data.logic(N > P)
The big distinction between what you said and what I can do now is that I only refine a module, not (yet) subsetting pipelines. Conceptually it is important because I'm never even interested in running data
in full, but rather only bits of it. On the other hand, subsetting pipelines implies there is an interest to run all of them eventually, but we just want to run different ones at different stages.
Similarly, defining un-runnable Ensembles such as A * B
may also mean that we would never run A
separatedly. For example vcf2ped * plink_magic * ped2vcf
in which case we are never intended to run vcf2ped
separately. This is the biggest difference between define
and run
because define
makes up new concepts.
Effectively I think we have three operations that can be applied to existing groups of pipelines
Agreed. Also agreed that your iii) subsetting with WHERE
-like statement can possibly achieve what I want with .logic
, but without calling them some names (big_data
, regular_data
). It becomes conceptually less clear.
That said, I totally agree we can support something like WHERE
, as my point made in the previous post that it is an engineering problem not a conceptual problem. Because I can see conceptually you put it under run:
in your example, which is exactly where I expect them to be. I also agree running subset of pipeline should support the same syntax as extraction. How that is implemented is details I will have to figure out (SQL vs no SQL ... there is a clear conflict here)
Now my question still remains: seen more detailed motivation of my .logic
example do you still see it necessary to help with concepts?
all (ordered) combinations of two groups (is there a math word for this? there must be...).
Cartesian product would imply ordered, right?
I should add that previously I had .logic
inside modules. My example would look like this:
big_data:
exec: simulate.R
input:
N: 200, 300, 400
P: 50, 100, 200, 1000
.logic: N <= P
output:
$Xtrain: Xtrain
$Xtest: Xtest
small_data:
template: big_data
input:
.logic: N > P
It avoids the awkwardness to define a module called data
that will never be fully executed. I think there was complaints to have it here inside modules, but I really like the concept so I somehow later re-introduced it in define
.
but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!)
Yes I see that. But I do want to see something effectively like make all
ie to be explicit of what we want to run, from within DSC. I think this would be the concept of the entire DSC, which we have not formally discussed! We can use a separate ticket if it helps, but i'd say DSC is a benchmark composed of all meaningful & runnable pipelines.
I think we are at odds on terminology.
In my terminology a module includes specific values of parameters. And an ensemble is a set of modules (may be similar, but different parameters).
So
data:
exec: simulate.R
input:
N: 200, 300, 400
P: 50, 100, 200, 1000
output:
$Xtrain: Xtrain
$Xtest: Xtest
defines an ensemble or set of modules, not a module.
And the logic operator is subsetting the ensemble.
On Wed, May 10, 2017 at 8:51 PM, gaow notifications@github.com wrote:
Here I understand data is an ensemble (or group) of modules
It is not the case. Sorry I should write it in full
data: exec: simulate.R input: N: 200, 300, 400 P: 50, 100, 200, 1000 output: $Xtrain: Xtrain $Xtest: Xtest
...
define: big_data: data.logic(N <= P) regular_data: data.logic(N > P)
The big distinction between what you said and what I can do now is that I only refine a module, not (yet) subsetting pipelines. Conceptually it is important because I'm never even interested in running data in full, but rather only bits of it. On the other hand, subsetting pipelines implies there is an interest to run all of them eventually, but we just want to run different ones at different stages.
Effectively I think we have three operations that can be applied to existing groups of pipelines
Agreed. Also agreed that your iii) subsetting with WHERE-like statement can possibly achieve what I want with .logic, but without calling them some names (big_data, regular_data). It becomes conceptually less clear.
That said, I totally agree we can support something like WHERE, as my point made in the previous post that it is an engineering problem not a conceptual problem. Because I can see conceptually you put it under run: in your example, which is exactly where I expect them to be. I also agree running subset of pipeline should support the same syntax as extraction. How that is implemented is details I will have to figure out (SQL vs no SQL ... there is a clear conflict here)
Now my question still remains: seen more detailed motivation of my .logic example do you still see it necessary to help with concepts?
all (ordered) combinations of two groups (is there a math word for this? there must be...).
Cartesian product would imply ordered, right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300660281, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xdF-HPx47G0SbCRxZRORZNV82gAkks5r4mm7gaJpZM4NKoCy .
I see the benefits of having an equivalent of "make all".
On Wed, May 10, 2017 at 8:58 PM, gaow notifications@github.com wrote:
I should add that previously I had .logic inside modules. My example would look like this:
big_data: exec: simulate.R input: N: 200, 300, 400 P: 50, 100, 200, 1000 .logic: N <= P output: $Xtrain: Xtrain $Xtest: Xtest
small_data: template: big_data input: .logic: N > P
It avoids the awkwardness to define a module called data that will never be fully executed. I think there was doubts to it here, but I really like the concept so I somehow later re-introduced it in define.
but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!)
Yes I see that. But I do want to see something effectively like make all -- I think this would be the concept of the entire DSC, which we have not formally discussed! We can use a separate ticket if it helps, but i'd say DSC is a benchmark composed of all meaningful & runnable pipelines.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300661167, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xVUaPSNw4lgzvSI1UZygSx8tapkiks5r4ms1gaJpZM4NKoCy .
Yes sorry my bad (for dropping this conversation and for being at odds on teminology). We do have agreed on that aspect of ensemble terminology and the logic operator is indeed subsetting the ensemble. However my example is one of the (maybe not so) rare case that it is easier to create ensemble by subsetting from another ensemble, than by building from scratch :) How should we achieve that otherwise?
I suggest that we allow for this subsetting of ensembles in define:
.
I suggest we use the WHERE syntax rather than logic syntax
since (I believe) we are also going to use WHERE syntax in the extraction
phase.
We can leave for now the implementation of subsetting aribtrary pipeline ensembles (groups of pipelines) and restrict to subsetting just "module ensembles" (groups of unit modules), if that helps. But we should have an eye on the more general case.
I suggest that for now we do not allow the subsetting within the module definition, to reduce complexity of module definition.
Matthew
On Thu, May 11, 2017 at 10:01 PM, gaow notifications@github.com wrote:
Yes sorry my bad (for dropping this conversation and for being at odds on teminology). We do have agreed on that aspect of ensemble terminology and the logic operator is indeed subsetting the ensemble. However my example is one of the (maybe not so) rare case that it is easier to create ensemble by subsetting from another ensemble, than by building from scratch :) How should we achieve that otherwise?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300971521, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xb9fIUwJC_64rhtNbRwW7T1wV5g2ks5r48t5gaJpZM4NKoCy .
I see. All suggestions sound good to me. I'll try to mimic SQL behavior in native Python. Should be simple if we restrict them simple; but hard otherwise. Anyways I'm glad we've reached agreement on the ideas and i'll worry about implementation next. Closing this issue for now and mark #83 for later
as you suggested
I've uploaded DSC for the Omega project in proposed syntax to:
https://github.com/stephenslab/dsc2-omega/blob/master/dsc2/omega.dsc
@stephens999 a couple of things we have not discussed before:
.logic
operator to fine-tune conditions. It has to use simple Python logic operators or lambda functions. I'll give basic examples in documentation so that people do not have to know Python.template
keyword. I'm not sure which is better:Or:
The first one is shorter yet is perhaps invalid YAML syntax that I have to try to get it work. The 2nd will introduce a new concept which we need to find a good term for it (called it
template
) for now.I will not explain no further but rather to leave you review the Omega to see if it is intuitive enough to achieve what we want.
@NKweiwang Please also review the script and tell me if it is everything you need in your benchmark, to the best of your knowledge at this point. And particularly the
DSC::run
section whether or not those are what you actually want to execute.