Open gaow opened 4 years ago
@gaow As we discussed in person, I think the best way to approach this is to provide more information to the user about what dscquery
is doing, and its progress. The interface should also provide better guidance to the user about how to use dscquery effectively.
@pcarbo I implemented a simple progress bar that shows percentage of tasks left and estimated time left:
> dscout <- dscquery(dsc.outdir = "dsc_result",
+ targets = c("simulate","analyze","score.error"))
dsc-query dsc_result -o /tmp/Rtmp4CNpsR/file63a3384d1981.csv --target "simulate analyze score.error" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Populating DSC output table of dimension 8 by 7.
- Loading targets [==========================] 100% eta: 0s
It might worth adding to it for internal diagnosis some monitoring stats such as CPU usage and disk i/o status, to see if there are other improvements we can make.
@gaow Very nice! That is certainly an improvement.
I'm testing out @fmorgante 's example myself. I noticed even running a regular query via the dsc-query
command it takes very long time to complete -- roughly 5 minutes before it can even start loading the output. The resulting matrix to fill up is 720000 x 12
less than a million module instances which is reasonable scale for a benchmark. It seems we need to work out the performance issue here both for the Python program dsc-query
and for the subsequent step to load in R.
@gaow One thing that might be helpful here would be to establish a "lower bound" on runtime. For instance, suppose I load matrices from hundreds of .rds
files into R, and combine them in a "smart" way. How long does this take? How does this compare to doing this in dscquery
? It shouldn't be hard to come up with a simple comparison.
dsc-query
the Python program does more than just lumping together those rds
files. And that took 5min for 720,000 rows which can be improved but not as bad as loading it which takes forever (>3hrs) before it got killed (as our users run out of patient). I'm using @fmorgante example and the progress bar to identify what code chunk is the culprit.
One think I notice is lines 571 to 582 of current dscquery.R
that does something even before loading any data. That is, it hangs there before hitting the progress bar. If you unzip this dataset test.csv.gz
(generated from dsc-query
command that took 5min), decompress it to test.csv
, and run this query:
dsc <- dscrutils::dscquery("./", c("score.err", "score", "simulate", "fit_pred", "simulate.n_traits", "simulate.pve", "simulate.n_signal", "small_data.subsetN"), return.type='list', verbose=T,cache='test.csv')
since I added the cache=test.csv
with my latest commit it should bring you right to those lines in question in that function, without having to have @fmorgante 's DSC outputs in place. You'll see it get stuck there under that two layer for loop before hitting the progress bar.
By "stuck" I'm talking about 2 hours as of now, and still counting!
Improving this could be a good first issue to someone with some computational background. Still it would be nice if you could verify it. I suspect it might be easier (or at least for me) to deal with it at the level of dsc-query
Python program but I'm sure better R code can also help.
@gaow One bottleneck was read.csv
; the performance instantly improved when I replaced it with fread
from the data.table
package.
There are some other places where the code is unnecessarily slow due to naive implementation. I will continue to work on this.
In any case, this is a very useful test case. (And it is the first time I've tried running dscquery
on a table with 700,000 rows.)
@fmorgante complains about low performance of
dscquery
for the scale of DSC he's working on. @fmorgante it would be helpful if you can tell us:Also since now we use RDS and PKL files to save output, we have to load the entire file to extract a specific quantity. This is a limitation that we cannot resolve unless we switch to other data storage solution as has long been discussed ..