Detect and disallow orphan downstream modules

zouyuxin commented 5 years ago

@pcarbo How to extract filenames using dscquery? I'm using dsc to benchmark susie summary stats version. For some simulated dataset, susie summary version fails to converge within a fixed number of iterations. So, I want to check the dataset to understand the problem.

dscout = dscquery('r_compare_add_z', targets = 'get_sumstats score_susie.converged', omit.filenames = FALSE)

used to work. But it gives ERROR now.

ERROR: Requested module get_sumstats is an orphan branch. Please consider using separate queries for information from this module.

Is there a way to extract filenames? Thanks.

pcarbo commented 5 years ago

@zouyuxin Which version of dscquery do you have? Does it work with omit.filenames = TRUE? Could you post the full error?

zouyuxin commented 5 years ago

I'm using dscrutils_0.3.8

gaow commented 5 years ago

@zouyuxin it looks like the bug (or "feature"?) is on dsc-query level. @pcarbo please hold on I'll take a look at this first now.

gaow commented 5 years ago

@zouyuxin the error is unfortunately a "feature" in dsc-query and the reason is exactly what it suggests. But I understand this can be confusing and sometimes unnecessarily complicated, even impossible, to develop the exact queries that ensures linear logic flow. Therefore I now downgraded the message to a WARNING so that your query will not be interrupted. The message now looks like:

WARNING: Requested module get_sumstats is an orphan branch with respect to susie; thus removed from sub-query involving susie.

Now you see that there is a complain in subquery susie which you do not care anyways.

Please test and let us know if you still have troubles extracting filenames!

(@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract)

pcarbo commented 5 years ago

@gaow Could you please explain this warning? I've seen this warning/error before, and I still don't quite understand it.

pcarbo commented 5 years ago

@pcarbo on a side note the output extraction is quite slow for @zouyuxin 's test example benchmark involving 12000 rows and 2 fields to extract.

Yes, if you have a lot of results, it will take a long time to load. No way around that!

gaow commented 5 years ago

No way around that!

We've yet to benchmark parallel I/O because I dont believe in a modern system it does not make a difference. And possibly 12,000 result is not a lot? I think the speed limit here is not just I/O but also the computation involve to process and load data.

There are some other design decisions we can possibly make, eg, different data storage structure or format. For example, it will be a lot faster if there is a way to store data in R and access only relevant keys in the disk not loading the entire object then extract relevant keys. But I'm not aware of such a solution to my current knowledge

Could you please explain this warning?

For example in a DSC two types of pipelines are executed simulate -> compute_summary_stats -> susie_with_summary_stats -> score and simulate -> susie_full_data -> score. Then a query asks for compute_summary_stats, susie, score. Then there will be a gap in the logic, that is, the flow is broken into susie -> score AND compute_summary_stats there is no way to connect these two. In this case I call compute_summary_stats an orphan. Current behavior is to look backwards from the end of pipelines, ignoring orphans along the way and give a warning.

pcarbo commented 5 years ago

We've yet to benchmark parallel I/O.

What is parallel I/O?

I think the speed limit here is not just I/O but also the computation involve to process and load data.

This will depend on many things. In general, assessing I/O load is difficult, and highly dependent on the computing infrastructure and the file system used. In short, I wouldn't bother with it.

pcarbo commented 5 years ago

@zouyuxin I'd like to take a look at your DSC, if that's okay. Where can I see it?

zouyuxin commented 5 years ago

It's at https://github.com/stephenslab/dsc-finemap/blob/dsc-R/r_compare.dsc

gaow commented 5 years ago

I take that this issue is resolved? We can use other tickets for discussions RE slowness in dscquery if we get more complains from users.

gaow commented 5 years ago

As one example requested by @pcarbo, the DSC is:

A: R()
  $out1: 1

B: R()
  $out2: 1

C: R()
  out1: $out1
  $out3: 1

DSC:
  run: A*B*C

and use dsc test.dsc to run, then:

> dscrutils::dscquery('test', targets='A C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c97e249c8a.csv --target "A C" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1
> dscrutils::dscquery('test', targets='A B')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bd1501f.csv --target "A B" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module A is an orphan branch with respect to module B; thus removed from sub-query involving module B.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC B.output.file
1   1         B/B_1
> dscrutils::dscquery('test', targets='A B C')
Calling dsc-query.
Running shell command:
dsc-query test -o /tmp/RtmpE24m5m/file39c96bb482ae.csv --target "A B C" --force
INFO: Loading database ...
INFO: Running queries ...
WARNING: Requested module B is an orphan branch with respect to module C; thus removed from sub-query involving module C.
INFO: Extraction complete!
Importing dsc-query output.
Reading DSC outputs.
  DSC A.output.file C.output.file
1   1         A/A_1     C/A_1_C_1

The first query will work, but the 2nd and 3rd will give warnings. I'll try not to explain it but let you decide if it is reasonable behavior? That way we know if the prompted message makes sense and how we can possibly improve it :) Note you'll need the current master version to reproduce, not the current release.

gaow commented 5 years ago

@pcarbo The warning text is now changed to "WARNING: Requested module B is not connected to module C;". The word "orphan" is from the term "orphan process" whose "parent" (upper level) process is not available. But it is a phrase we should avoid anyways.

pcarbo commented 5 years ago

Consider the following very simple pipeline:

DSC:
  run: A * B * C

Suppose that module C does not use any of the outputs from module B, and only uses outputs from module A. (This may seem like an odd thing to do, but in practice it could happen.)

Due to the way this would be run internally, this odd situation creates several downstream issues, particularly in querying the results. (I'll spare you the details, but happy to provide them if you want.)

This is my proposed rule:

In DSC, we should require that module C take as input at least one output from module B.

There's some additional motivation for this rule:

When C does not depend on B, internally this would be run as:

(A * B, A * C)

Which could make querying the DSC results difficult if you aren't expecting this.

I'm not saying that this is the best way to do this, but in light of this fact, I was expediently suggesting this rule to avoid downstream querying issues.

stephenslab / dsc

Detect and disallow orphan downstream modules #183