stephenslab / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
12 stars 12 forks source link

dscquery takes long time to load data #203

Open gaow opened 4 years ago

gaow commented 4 years ago

@fmorgante complains about low performance of dscquery for the scale of DSC he's working on. @fmorgante it would be helpful if you can tell us:

  1. Link to your DSC (ideally a specific commit on github)
  2. the query you run
  3. the exact time it took for you to get the output table a note to myself: need to add to the code to report time elapsed
  4. the dimension of your output table
  5. a list of column names in your output table (need that to determine how many datasets were loaded)

Also since now we use RDS and PKL files to save output, we have to load the entire file to extract a specific quantity. This is a limitation that we cannot resolve unless we switch to other data storage solution as has long been discussed ..

pcarbo commented 4 years ago

@gaow As we discussed in person, I think the best way to approach this is to provide more information to the user about what dscquery is doing, and its progress. The interface should also provide better guidance to the user about how to use dscquery effectively.

gaow commented 4 years ago

@pcarbo I implemented a simple progress bar that shows percentage of tasks left and estimated time left:

> dscout <- dscquery(dsc.outdir = "dsc_result",
+                    targets    = c("simulate","analyze","score.error"))

dsc-query dsc_result -o /tmp/Rtmp4CNpsR/file63a3384d1981.csv --target "simulate analyze score.error" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Populating DSC output table of dimension 8 by 7.
- Loading targets [==========================] 100% eta:  0s

It might worth adding to it for internal diagnosis some monitoring stats such as CPU usage and disk i/o status, to see if there are other improvements we can make.

pcarbo commented 4 years ago

@gaow Very nice! That is certainly an improvement.

gaow commented 4 years ago

I'm testing out @fmorgante 's example myself. I noticed even running a regular query via the dsc-query command it takes very long time to complete -- roughly 5 minutes before it can even start loading the output. The resulting matrix to fill up is 720000 x 12 less than a million module instances which is reasonable scale for a benchmark. It seems we need to work out the performance issue here both for the Python program dsc-query and for the subsequent step to load in R.

pcarbo commented 4 years ago

@gaow One thing that might be helpful here would be to establish a "lower bound" on runtime. For instance, suppose I load matrices from hundreds of .rds files into R, and combine them in a "smart" way. How long does this take? How does this compare to doing this in dscquery? It shouldn't be hard to come up with a simple comparison.

gaow commented 4 years ago

dsc-query the Python program does more than just lumping together those rds files. And that took 5min for 720,000 rows which can be improved but not as bad as loading it which takes forever (>3hrs) before it got killed (as our users run out of patient). I'm using @fmorgante example and the progress bar to identify what code chunk is the culprit.

gaow commented 4 years ago

One think I notice is lines 571 to 582 of current dscquery.R that does something even before loading any data. That is, it hangs there before hitting the progress bar. If you unzip this dataset test.csv.gz (generated from dsc-query command that took 5min), decompress it to test.csv, and run this query:

dsc <- dscrutils::dscquery("./", c("score.err", "score", "simulate", "fit_pred", "simulate.n_traits", "simulate.pve", "simulate.n_signal", "small_data.subsetN"), return.type='list', verbose=T,cache='test.csv')

since I added the cache=test.csv with my latest commit it should bring you right to those lines in question in that function, without having to have @fmorgante 's DSC outputs in place. You'll see it get stuck there under that two layer for loop before hitting the progress bar.

By "stuck" I'm talking about 2 hours as of now, and still counting!

Improving this could be a good first issue to someone with some computational background. Still it would be nice if you could verify it. I suspect it might be easier (or at least for me) to deal with it at the level of dsc-query Python program but I'm sure better R code can also help.

pcarbo commented 4 years ago

@gaow One bottleneck was read.csv; the performance instantly improved when I replaced it with fread from the data.table package.

There are some other places where the code is unnecessarily slow due to naive implementation. I will continue to work on this.

In any case, this is a very useful test case. (And it is the first time I've tried running dscquery on a table with 700,000 rows.)