investigate cuDF memory usage

jangorecki commented 5 years ago

Grouping benchmark is running fine for cuDF on 1e7 rows (0.5 GB csv) data. When trying to run 1e8 rows (5 GB csv) data it is failing during/after loading data with

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

This is surprising because there should be enough memory, we have 21 GB of memory in GPU:

> nvidia-smi
Wed Aug 21 02:33:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:02:00.0 Off |                  N/A |
| 23%   33C    P8    16W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:81:00.0 Off |                  N/A |
| 23%   37C    P8    12W / 250W |      1MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

cuDF 0.8.0

datametrician commented 4 years ago

cuDF is single GPU only but can use both GPUs with Dask (Dask-cuDF). With 5GB you’re probably running OOM. With Dask-cuDF you can break the loading into smaller chunks which will make this fit on a 1080ti.

jangorecki commented 4 years ago

Thanks @datametrician, this is very valuable information. Also surprising one, the fact that cuDF does not have that built-in. I filled a question in https://github.com/rapidsai/cudf/issues/3374 I opened new issue https://github.com/h2oai/db-benchmark/issues/116

datametrician commented 4 years ago

All of our algorithms and kernels use Dask for scaling to multiple GPU within a machine and multi-node. It's in a way no different than pandas and SKL being single core or single node by default, and you need something like dask to scale them distributed. This make the programming model more straight forward where single node multi-gpu code is the same as multi-node multi-gpu.

jangorecki commented 4 years ago

It is surely an advantage when you might want to scale to multiple machines, but when you are only interested in a single node environment the you are likely to get better performance when developing parallelism for specially for a single node. According the linked question in cudf it is not on a roadmap and dask cudf is going to be standard way to parallelise also on a single machine.

datametrician commented 4 years ago

I actually don't think you get more performance even on a single node. UCX optimizes transport with RDMA so the performance on a single node should be identical, plus it also scales as an upside :)

jangorecki commented 4 years ago

I investigated what are the limits we are currently capable to compute using our 2x GeForce GTX 1080 Ti.

I used following query while running various data sizes (note that cardinality didn't change).

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out

cudf

Using 11GB we are able to compute groupby benchmark up to 5e7 rows, which is 2.3 GB csv. And we are starting to run out of memory when using 6e7 rows, 2.8 GB csv data. Groupby on 1e7 rows, 0.45GB csv required 2.25 GB of memory. Using dask-cudf (#116), might or may not allow to compute 1e8 data 4.7 GB csv. Assuming scaling required memory linearly it is likely it will run out of memory. max_mem-cudf

datametrician commented 4 years ago

With Dask-cuDF you can read chunks of a csv, so you can squeeze more in. You can read 1M rows at a time, and should be able to get more in memory (plus you can spill to sys mem, but that hurts perf)

jangorecki commented 4 years ago

@datametrician Thanks for info. Any idea if processing by chunks requires micromanaging (like merging groups from different chunks manually - re-aggregate). Or is it just a matter of setting a parameter? I recently added support for dask on-disk data storage using parquet. The only change was to use different function to load the data. No extra post-query chunks micromanaging was required. As a result dak can compute 1e9 data size groupby. The only problem is that it is up to 100 times slower than other tools.

datametrician commented 4 years ago

Just a matter of setting a parameter. Dask, in this case, will use system memory instead of Disk. Hopefully, it's not 100x slower. That being said, I think the easiest thing to do is upgrade the GPU to RTX8000s. This is what the NVIDIA DSWS ships with.

jangorecki commented 4 years ago

posting code I used in this issue for future reference

nvidia-smi --query-gpu=timestamp,index,memory.total,memory.used,memory.free --format=csv,noheader,nounits -lms 100 -f gpu_G1_1e7_1e2.out

library(data.table)
rbindlist(
  lapply(setNames(nm=list.files(pattern="^gpu_G1_.*\\.out$")), fread, col.names=
c("timestamp","gpu","total","used","free")),
  idcol="in_rows"
)[, "in_rows":=sapply(strsplit(in_rows, "_", fixed=TRUE), `[[`, 3L)][
  ] -> d
d[, timestamp:=as.POSIXct(timestamp)]

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]

subset.used = function(dt) {
  r = range(which(dt$used > 1))
  dt[r[1L]:r[2L]]
}
dd = d[in_rows%in%c("1e7","3e7","5e7","6e7"), subset.used(.SD), by="in_rows"]
dd[, timestamp:=seq_along(timestamp)]

lattice::xyplot(
  used ~ timestamp | in_rows,
  dd, type="l", grid=TRUE, groups=gpu,
  main="cudf GPU memory usage",
  xlab = "timestamp",
  ylab = "MB",
  scales=list(y=list(
    relation="free"#,
    #limits=rep(ld[solution==s, .(ylim=max(c(0, time_sec_1), na.rm=TRUE)), in_rows][ylim>0, list(list(c(0, ylim))), in_rows]$V1, each=3)
  )),
  auto.key=list(points=FALSE, lines=TRUE)
)

d[in_rows=="1e7", plot(type="l", x=timestamp, y=used)]
#grep G1_5e7 time.csv | cut -d"," -f 5,7,10,14,15
data.table::fread("
G1_5e7_1e2_0_0,sum v1 by id1,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1,cudf,2,0.05
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,1,0.057
G1_5e7_1e2_0_0,sum v1 by id1:id2,cudf,2,0.058
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,1,0.278
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,cudf,2,0.278
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,1,0.125
G1_5e7_1e2_0_0,mean v1:v3 by id4,cudf,2,0.117
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,1,0.303
G1_5e7_1e2_0_0,sum v1:v3 by id6,cudf,2,0.303
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,1,0.707
G1_5e7_1e2_0_0,sum v3 count by id1:id6,cudf,2,0.705
G1_5e7_1e2_0_0,sum v1 by id1,data.table,1,0.54
G1_5e7_1e2_0_0,sum v1 by id1,data.table,2,0.405
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,1,0.533
G1_5e7_1e2_0_0,sum v1 by id1:id2,data.table,2,0.486
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,1,0.672
G1_5e7_1e2_0_0,sum v1 mean v3 by id3,data.table,2,0.639
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,1,0.837
G1_5e7_1e2_0_0,mean v1:v3 by id4,data.table,2,0.854
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,1,0.704
G1_5e7_1e2_0_0,sum v1:v3 by id6,data.table,2,0.697
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,1,5.003
G1_5e7_1e2_0_0,median v3 sd v3 by id4 id5,data.table,2,4.822
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,1,2.397
G1_5e7_1e2_0_0,max v1 - min v2 by id3,data.table,2,2.388
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,1,4.414
G1_5e7_1e2_0_0,largest two v3 by id6,data.table,2,4.342
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,1,2.697
G1_5e7_1e2_0_0,regression v1 v2 by id2 id4,data.table,2,2.232
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,1,3.196
G1_5e7_1e2_0_0,sum v3 count by id1:id6,data.table,2,2.828",
col.names=c("data","question","solution","run","time")) -> d
d[, iquestion:=as.integer(factor(question, levels=unique(question)))]
d[solution=="data.table"][d[solution=="cudf"], on=c("data","iquestion","run"), nomatch=NULL][, .(iquestion, run, dt=time, cudf=i.time)] -> dd
dd[, cudf2dt := cudf/dt][] -> ddd
ddd
rbind(ddd, list(NA, NA, sum(ddd$dt), sum(ddd$cudf), mean(ddd$cudf2dt)))

h2oai / db-benchmark

investigate cuDF memory usage #94