DSC `.logic` operator in DSC section

stephenslab / dsc

Repo for Dynamic Statistical Comparisons project

https://stephenslab.github.io/dsc-wiki

MIT License

12 stars 12 forks source link

DSC `.logic` operator in DSC section #79

Closed gaow closed 7 years ago

gaow commented 7 years ago

I've uploaded DSC for the Omega project in proposed syntax to:

https://github.com/stephenslab/dsc2-omega/blob/master/dsc2/omega.dsc

@stephens999 a couple of things we have not discussed before:

the .logic operator to fine-tune conditions. It has to use simple Python logic operators or lambda functions. I'll give basic examples in documentation so that people do not have to know Python.
the template keyword. I'm not sure which is better:
```
glasso: tiger
exec: glasso.R
```
Or:
```
glasso:
template: tiger
exec: glasso.R
```
The first one is shorter yet is perhaps invalid YAML syntax that I have to try to get it work. The 2nd will introduce a new concept which we need to find a good term for it (called it template) for now.

I will not explain no further but rather to leave you review the Omega to see if it is intuitive enough to achieve what we want.

@NKweiwang Please also review the script and tell me if it is everything you need in your benchmark, to the best of your knowledge at this point. And particularly the DSC::run section whether or not those are what you actually want to execute.

stephens999 commented 7 years ago

I think the use of .logical needs some work or modification. It appears in multiple places and is confusing.

1a. First, I don't think it should be used (or at least not encouraged) in defining the output of modules. So not:

Omega = .logic(Omega if .P < .N else 0)

If the output of the module depends on the parameter settings then that should be encoded in the executable of the module, not in the dsc module definition, which should really serve as just a "wrapper" that says which variables go where. In this case I think the pipeline variable $meta.n_glasso should be entirely removed - it does not make sense for that to be part of the simulation scenario, whether or not we allow logic operators! The modules glasso1, glasso2, glasso3 will have to make their own decision on what K to use, and that should be encoded in the executables not in the module definition.

On a related note, it is cheating to pass $meta$K into flash if that is the true number of factors. It may be acceptable to pass it into a "gold standard" method that is acknowledged as cheating, but done as a way of comparing against a method that has additional information. But that should be a separate module (flash_cheat say?)

And it seems you don't need to pass P and N in $meta either - they are implicitly part of the test and training data, and not used later as far as I can see?

And the simulation executables themselves should be responsible for returning NA for Omega if it does not exist.

minor: diagnoal typo

more later...

stephens999 commented 7 years ago

minor: it would be clearer if $Omega output from methods were called $Omega_est

stephens999 commented 7 years ago

minor: instead of defining "tiger" as the template, clearer to define "method-template" as a template (with no exec) and then have tiger (and other methods) inherit from that template.

stephens999 commented 7 years ago

minor: in score, return should be output, and score: should be $score:

stephens999 commented 7 years ago

I'm not sure what you are trying to achieve with the other logic statements, which occur in the module definitions for the simulation scenarios and in the define statements.

It seems there are a few things:

running only a subset of pipelines
giving distinct names (tags?) to different subsets of modules

I think both could be better achieved in different ways. For now I'll just say: i) Neither of them seem to require logic statements in the actual module definitions, so I would get rid of those. ii) naming distinct subsets in different ways should probably be taken care of downstream in the extraction ? iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.

Also, do we really need to run subsets here? Can we just run everything?

gaow commented 7 years ago

@stephens999 Thank you for the feedback!

But that should be a separate module (flash_cheat say?) And it seems you don't need to pass P and N in $meta either minor: it would be clearer if $Omega output from methods were called $Omega_est minor: in score, return should be output, and score: should be $score: minor: diagnoal typo

Fixed.

We are "cheating" a lot here mostly via .logic because @NKweiwang think this DSC is more for "exploration", that we don't hard-code any rules in modules; rather we use DSC as a convenient interface to pass in parameters (based on some conditions from meta) and study the outcome to understand things better, before finalizing the module codes. The up-side for this is that the module code themselves would be less subjective in that it does not dictate choices of parameters based on input. I think the debate here is more fundamental than DSC sytnax itself. I'm going to split this post to DSC related and Omega related issues. I'm going to post DSC related here, and Omega related in the Omega repo.

DSC related

There seems no objection for .logic in input. Do we want .logic in output?
instead of defining "tiger" as the template, clearer to define "method-template" as a template

I think it is a matter of styling, as in object oriented programming someone would prefer base class do nothing, some would prefer base class itself is usable. I'm cool either way (I also program either way). So are we suggesting that if a module does not have exec then it is a template module? I take that a template module can also template from another template module?

I'd like to add that we do want to be able to derive from some non-template modules, see:

https://github.com/stephenslab/dsc2-omega/blob/master/dsc2/omega.dsc#L87

So I'm viewing the notion of template module a bit redundant.

I'm not sure what you are trying to achieve with the other logic statements, It seems there are a few things: running only a subset of pipelines; giving distinct names (tags?) to different subsets of modules

Exactly. Sometimes we only want pairs of parameters to appear. We may well write:

np: (20, 100), (100, 20)

But this will make downstream query difficult. Even storing the info in R would be difficult: imaging a np column storing a pair of values in data.frame. I can only think of storing it as a string.

ii) naming distinct subsets in different ways should probably be taken care of downstream in the extraction ?

Same argument: yes we can, but that may make extraction difficult. A big motivation of adding tags is to make downstream query easier, and .logic is a way to add such tags.

gaow commented 7 years ago

i) Neither of them seem to require logic statements in the actual module definitions,

iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.

I'm not following these 2 comments ... would you give an example at your convenience? I've address other concerns in this post and in https://github.com/stephenslab/dsc2-omega/issues/1 which will involve @NKweiwang 's input.

stephens999 commented 7 years ago

i) Neither of them seem to require logic statements in the actual module definitions

What I am saying here is that the two "legitimate" (to my mind) uses of logic statements are to i) name subsets of modules; ii) select subsets of pipelines to run. Neither of these legitimate uses require the use of logic statements within the module definitions (eg in input: or output:) Thus my initial suggestion is for dsc not to support logic statements within the module definitions. It seems that such statements are effectively the start of a "mini-language" that allows the module definitions to actually control the behavior of the modules, rather than their intended purpose which is to effectively map variables between pipeline and module executable. I think this is feature creep we want to avoid.

In the future, if we were to want this kind of functionality (adding code to existing executable), I would suggest incorporating "prologue" and "epilogue"s to the modules that run R or python code either before or after the code in the executable. Indeed, you effectively already have that kind of functionality when you use R() in the output.... I'm not quite sure how general your current implementation is in terms of what R code can be run here. But for now in any case I think we defer this issue, and remove the logic statements from within the modules in this example.

stephens999 commented 7 years ago

iii) running subsets of pipelines is pretty much analogous to extracting subsets of pipelines, so we should use the same syntax for running subsets as extracting subsets.

What I am saying is that, as we have discussed, we need syntax for extracting results for subsets of pipelines, and that we should use the same syntax for specifying which pipelines to run.

As I understand it the syntax for extracting results for subsets of pipelines is not yet implemented but we have discussed SQL-like syntax where you would have things like WHERE (simulate.n < 100 & simulate.K>10) to extract subsets of pipeline results.

So we can have the same type of syntax in defining subsets of pipelines to run... However, I don't have the syntax sufficiently clear in my head to give an example.

gaow commented 7 years ago

Thanks @stephens999 it is clear to me now. Combining your i) and iii) I see that we should not use syntax such as logic in module definitions, but we certainly can achieve the same when we execute a pipeline (eg in dsc-omega example we restrict runs to only allowing for n<p case, or we set K according to value of p -- though whether or not this "cheating" should be done here is discussion of another ticket https://github.com/stephenslab/dsc2-omega/issues/1). The question is the syntax, and whether it should be done via command interface or DSC::define/run, or both. I'll update omega example as we discuss more on the other ticket.

stephens999 commented 7 years ago

it seems that this is part of a larger issue: within the DSC section, what are we defining?

For example, in the run: section you are currently defining things (it seems to me). Do we really need both run and define ?

Maybe the point is that some things you define are not runnable, whereas others are?

I suggest that in the dsc file you just want to define things, and say what to run at command time. Like a makefile defines targets, but you say what to build with make all etc.

The difference here is that you want to define things that are not targets?

gaow commented 7 years ago

Do we really need both run and define ?

Yes. define groups modules into ensembles and allow a chance to use some logic. run would be a very clear line of sequence to execute. Additionally users can overwrite run from command line. This is good in research mode when we want to test things out, and also as with the case of dsc-omega when we write up the entire benchmark but wish to execute them bit by bit.

Maybe the point is that some things you define are not runnable, whereas others are?

Only things involving modules that do not take pipeline variable inputs are runable.

in the dsc file you just want to define things, and say what to run at command time.

This will jeopardize reproducibility because in that case one cannot tell by looking at the script what is runnable. Even makeifle has to define a make all to explicitly say what all means. With file targets this logic can be implicit. But unfortunately here we need to be explicit, thus the need for run

stephens999 commented 7 years ago

let's start with run: What is the purpose of the run section: To define pipelines that could be run, or to actually define things that are to be run?

Or maybe both: to define pipelines that could be run, with the default being to do all of them?

That is, if we have

run: s1: sim1 method score s2: sim2 method score

then at run time, we have something like dsc # runs s1 and s2 dsc s1 # runs s1? (probably there is a flag i don't know?)

If this is the case I prefer to think of this as defining pipelines (groups/ensembles of pipelines? dscs?) not running them...that is I think target: or pipeline: or dsc: might be better keywords.

I understand you also want to be able to run a subset of pipelines. It would seem to be desirable to be able to specify that at runtime...

eg:

dsc s1 WHERE sim1.n==100

This gets at a different issue though - it relates to the fact that we want a general syntax for specifying subsets of pipelines, and use the same syntax for running subsets and extracting results from subsets.

Matthew

On Tue, May 9, 2017 at 9:27 PM, gaow notifications@github.com wrote:

Do we really need both run and define ?

Yes. define groups modules into ensembles and allow a chance to use some logic. run would be a very clear line of sequence to execute. Additionally users can overwrite run from command line. This is good in research mode when we want to test things out, and also as with the case of dsc-omega when we write up the entire benchmark but wish to execute them bit by bit.

Maybe the point is that some things you define are not runnable, whereas others are?

Only things involving modules that do not take pipeline variable inputs are runable.

in the dsc file you just want to define things, and say what to run at command time.

This will jeopardize reproducibility because in that case one cannot tell by looking at the script what is runnable. Even makeifle has to define a make all to explicitly say what all means. With file targets this logic can be implicit. But unfortunately here we need to be explicit, thus the need for run

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300355955, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xX1kKIP_yhaKdzPtysY9KIS7OLu9ks5r4SCMgaJpZM4NKoCy .

gaow commented 7 years ago

that is I think target: or pipeline: or dsc: might be better keywords.

I confirm everything you said prior to this sentence is true.

that is I think target: or pipeline: or dsc: might be better keywords.

Certainly there will be more work to use unified command interface for execution and extraction -- we decide to utilize SQL interface as well as some of its power to make data extractions. But execution does not involve SQL.

Let's leave alone engineering issues for a moment and focus on the concepts. Also let's stick to the terms define and run for the sake of discussion. With logic I can have a chance to define a more refined module. eg: something like:

define: 
   big_data: data.logic(N < P)

This is not possible with simple module specification because we run combinations of N and P by default. So the keyword define has two usages:

handles some logic here to refine modules
define ensembles. There may be cases when things would not make sense unless they are ensembles, eg. VOOM + Lima (this is not a good example but I hope I made a point here).

This is why I see logic and define useful. They make what we want to compare much clearer.

Now about run: yes essentially it is just make all. And it can be left blank if you want: the official interface for run is:

dsc settings.dsc ... -s "sim * voom"

where you see I intentionally only run an incomplete pipeline. So if you prefer you can leave the entire run section out. This is totally fine. I insisted having run in dsc the same reason I think it makes sense to define make all. But it is perhaps different work style.

Whether or not to have -s switch support WHERE statement is a different topic of discussion and is always an enhancement we can make after we implemented what we have discussed.

stephens999 commented 7 years ago

regarding:

With logic I can have a chance to define a more refined module. eg: something like:
define: 
big_data: data.logic(N < P)

Here I understand data is an ensemble (or group) of modules (this concept we have not yet fully defined, but roughly a collection of similar modules). - one for each (N,P) combination, and you are just subsetting that ensemble/group

That is, you are not "defining a more refined module" but subsetting an ensemble/groupdata to define a new (more refined) ensemble/group big_data.

If we treat a module as (logically) a special case of a pipeline (one with just one module) then the natural generalization of what you want to be able to do here is specify subsets of groups of pipelines (roughly a set of pipelines).

Effectively I think we have three operations that can be applied to existing groups of pipelines (which includes a module as the simplest possible unit) to create new groups of pipelines: i) grouping: (p1,p2) ii) concatenation: p1*p2 iii) subsetting: p1 WHERE xx (possible values of xx to be discussed?)

[Here p1 and p2 are groups of pipelines]. Here grouping creates a union of two groups and concatenation is the set of all (ordered) combinations of two groups (is there a math word for this? there must be...). Effectively these are all set operations (maybe we should use set instead of group or ensemble?)

My vision is that iii) might support things like:

simulate: (normal,t,cauchy)
method: (mean,median)
score: (mse,mae)

all: simulate * method * score
small_n: all WHERE simulate.n<100
mean_only: all WHERE method.id == mean

hmm... maybe a lot to digest here... but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!) And that a similar idea is required for extracting results from subsets of pipelines, so we should try to use the same syntax...

gaow commented 7 years ago

Here I understand data is an ensemble (or group) of modules

It is not the case. Sorry I should write it in full

data: 
 exec: simulate.R
 input:
    N: 200, 300, 400
    P: 50, 100, 200, 1000
 output:
    $Xtrain: Xtrain
    $Xtest: Xtest

...

define: 
   big_data: data.logic(N <= P)
   regular_data: data.logic(N > P)

The big distinction between what you said and what I can do now is that I only refine a module, not (yet) subsetting pipelines. Conceptually it is important because I'm never even interested in running data in full, but rather only bits of it. On the other hand, subsetting pipelines implies there is an interest to run all of them eventually, but we just want to run different ones at different stages.

Similarly, defining un-runnable Ensembles such as A * B may also mean that we would never run A separatedly. For example vcf2ped * plink_magic * ped2vcf in which case we are never intended to run vcf2ped separately. This is the biggest difference between define and run because define makes up new concepts.

Effectively I think we have three operations that can be applied to existing groups of pipelines

Agreed. Also agreed that your iii) subsetting with WHERE-like statement can possibly achieve what I want with .logic, but without calling them some names (big_data, regular_data). It becomes conceptually less clear.

That said, I totally agree we can support something like WHERE, as my point made in the previous post that it is an engineering problem not a conceptual problem. Because I can see conceptually you put it under run: in your example, which is exactly where I expect them to be. I also agree running subset of pipeline should support the same syntax as extraction. How that is implemented is details I will have to figure out (SQL vs no SQL ... there is a clear conflict here)

Now my question still remains: seen more detailed motivation of my .logic example do you still see it necessary to help with concepts?

all (ordered) combinations of two groups (is there a math word for this? there must be...).

Cartesian product would imply ordered, right?

gaow commented 7 years ago

I should add that previously I had .logic inside modules. My example would look like this:

big_data:
  exec: simulate.R
  input:
    N: 200, 300, 400
    P: 50, 100, 200, 1000
    .logic: N <= P
 output:
    $Xtrain: Xtrain
    $Xtest: Xtest

small_data:
   template: big_data
   input: 
    .logic: N > P

It avoids the awkwardness to define a module called data that will never be fully executed. I think there was complaints to have it here inside modules, but I really like the concept so I somehow later re-introduced it in define.

but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!)

Yes I see that. But I do want to see something effectively like make all ie to be explicit of what we want to run, from within DSC. I think this would be the concept of the entire DSC, which we have not formally discussed! We can use a separate ticket if it helps, but i'd say DSC is a benchmark composed of all meaningful & runnable pipelines.

stephens999 commented 7 years ago

I think we are at odds on terminology.

In my terminology a module includes specific values of parameters. And an ensemble is a set of modules (may be similar, but different parameters).


data:
 exec: simulate.R
 input:
    N: 200, 300, 400
    P: 50, 100, 200, 1000
 output:
    $Xtrain: Xtrain
    $Xtest: Xtest

defines an ensemble or set of modules, not a module.

And the logic operator is subsetting the ensemble.

On Wed, May 10, 2017 at 8:51 PM, gaow notifications@github.com wrote:

Here I understand data is an ensemble (or group) of modules

It is not the case. Sorry I should write it in full

data: exec: simulate.R input: N: 200, 300, 400 P: 50, 100, 200, 1000 output: $Xtrain: Xtrain $Xtest: Xtest

...

define: big_data: data.logic(N <= P) regular_data: data.logic(N > P)

The big distinction between what you said and what I can do now is that I only refine a module, not (yet) subsetting pipelines. Conceptually it is important because I'm never even interested in running data in full, but rather only bits of it. On the other hand, subsetting pipelines implies there is an interest to run all of them eventually, but we just want to run different ones at different stages.

Effectively I think we have three operations that can be applied to existing groups of pipelines

Agreed. Also agreed that your iii) subsetting with WHERE-like statement can possibly achieve what I want with .logic, but without calling them some names (big_data, regular_data). It becomes conceptually less clear.

That said, I totally agree we can support something like WHERE, as my point made in the previous post that it is an engineering problem not a conceptual problem. Because I can see conceptually you put it under run: in your example, which is exactly where I expect them to be. I also agree running subset of pipeline should support the same syntax as extraction. How that is implemented is details I will have to figure out (SQL vs no SQL ... there is a clear conflict here)

Now my question still remains: seen more detailed motivation of my .logic example do you still see it necessary to help with concepts?

all (ordered) combinations of two groups (is there a math word for this? there must be...).

Cartesian product would imply ordered, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300660281, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xdF-HPx47G0SbCRxZRORZNV82gAkks5r4mm7gaJpZM4NKoCy .

stephens999 commented 7 years ago

I see the benefits of having an equivalent of "make all".

On Wed, May 10, 2017 at 8:58 PM, gaow notifications@github.com wrote:

I should add that previously I had .logic inside modules. My example would look like this:

big_data: exec: simulate.R input: N: 200, 300, 400 P: 50, 100, 200, 1000 .logic: N <= P output: $Xtrain: Xtrain $Xtest: Xtest

small_data: template: big_data input: .logic: N > P

It avoids the awkwardness to define a module called data that will never be fully executed. I think there was doubts to it here, but I really like the concept so I somehow later re-introduced it in define.

but maybe you see my point that effectively all we are doing ultimately is defining sets of pipelines (some of which may be runnable!)

Yes I see that. But I do want to see something effectively like make all -- I think this would be the concept of the entire DSC, which we have not formally discussed! We can use a separate ticket if it helps, but i'd say DSC is a benchmark composed of all meaningful & runnable pipelines.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300661167, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xVUaPSNw4lgzvSI1UZygSx8tapkiks5r4ms1gaJpZM4NKoCy .

gaow commented 7 years ago

Yes sorry my bad (for dropping this conversation and for being at odds on teminology). We do have agreed on that aspect of ensemble terminology and the logic operator is indeed subsetting the ensemble. However my example is one of the (maybe not so) rare case that it is easier to create ensemble by subsetting from another ensemble, than by building from scratch :) How should we achieve that otherwise?

stephens999 commented 7 years ago

I suggest that we allow for this subsetting of ensembles in define:. I suggest we use the WHERE syntax rather than logic syntax since (I believe) we are also going to use WHERE syntax in the extraction phase.

We can leave for now the implementation of subsetting aribtrary pipeline ensembles (groups of pipelines) and restrict to subsetting just "module ensembles" (groups of unit modules), if that helps. But we should have an eye on the more general case.

I suggest that for now we do not allow the subsetting within the module definition, to reduce complexity of module definition.

Matthew

On Thu, May 11, 2017 at 10:01 PM, gaow notifications@github.com wrote:

Yes sorry my bad (for dropping this conversation and for being at odds on teminology). We do have agreed on that aspect of ensemble terminology and the logic operator is indeed subsetting the ensemble. However my example is one of the (maybe not so) rare case that it is easier to create ensemble by subsetting from another ensemble, than by building from scratch :) How should we achieve that otherwise?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc2/issues/79#issuecomment-300971521, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xb9fIUwJC_64rhtNbRwW7T1wV5g2ks5r48t5gaJpZM4NKoCy .

gaow commented 7 years ago

I see. All suggestions sound good to me. I'll try to mimic SQL behavior in native Python. Should be simple if we restrict them simple; but hard otherwise. Anyways I'm glad we've reached agreement on the ideas and i'll worry about implementation next. Closing this issue for now and mark #83 for later as you suggested