Cuda Error, When run the quiver example with the dataset of ogbn-papers100M in the benchmark

quiver-team / torch-quiver

PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.

https://torch-quiver.readthedocs.io/en/latest/

Apache License 2.0

293 stars 36 forks source link

Closed SoupFree closed 2 years ago

SoupFree commented 2 years ago

您好，我发现了一个bug，当执行的benchmark ogbn-papers100M的示例程序的时候出现了Cuda Error（CUBLAS_STATUS_NOT_INITIALIZED），当然我已经根据文档中提示已经执行了

我通过Debug发现了具体具体出错的地方，当执行cudaHostRegister时候会返回cuda error，但是没有进行拦截，返回错误code是1，我查了一下具体错误是 cudaErrorInvalidValue

通过日志分析发现，当cudaHostRegister一旦操作进行了30000000000 Bytes HostMapped的时候就会出错，比较奇怪。

辛苦开发者帮忙看看这个问题，谢谢！

eedalong commented 2 years ago

看起来是memory不够用了，你要不试一下把 shm 调的更大一些看看

SoupFree commented 2 years ago

我们的memory是够用的，同时把shm的memory设置到了256G还是出错， mount -o remount,size=256G /dev/shm

但是top的显示shared的memory是25G，不知道是不是这个原因

ZenoTan commented 2 years ago

eedalong commented 2 years ago

你可以试试这种方法能不能行 https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm

@SoupFree 使用这种方式之后问题解决了么～

eedalong commented 2 years ago

好的～