gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Other
1.11k stars 505 forks source link

GPU memory size over 4GB #71

Open cyk0521 opened 6 years ago

cyk0521 commented 6 years ago

I'm trying to run a very large scale application which consumes over 4GB GPU memory on GPGPU-sim dev branch with GTX1080Ti config. (CUDA7.5, gcc-4.4.7, g++-4.4.7, Ubuntu14.04)

However, it doesn't work since GPU memory corruption (maybe overflow). So, I inspected source codes of GPGPU-sim and found "address_type" variable type at src/abstract_hardware_model.h.

src/abstract_hardware_model.h: line 70, typedef unsigned address_type;

I thought it may cause the memory overflow, I modified "unsigned" to "unsigned long long" for 64bit addressing, but it causes new problems with my GPU kernel. (My CUDA application works fine with native GPU GTX1080Ti)

Is it correct current version of GPGPU-sim has <4GB limitation? If so, how can I overcome it?

abhi1212 commented 6 years ago

What type of application are you trying to run?

cyk0521 commented 6 years ago

I have ported darknet (https://github.com/pjreddie/darknet) for GPGPU-sim (cuBLAS and cuRAND were replaced by my own kernels), and executed VGGnet-16 (inference).

It works fine with my native GPU1080Ti, but it shows buffer overflow while loading weight data from a file (528MB). GPGPU-sim consumed more than 4GB memory while loading the weight file of VGGnet-16, and it showed buffer overflow which was not shown at native GPU.

abhi1212 commented 6 years ago

Which cfg file are you using? How many layers execute on the simulator before you get an overflow error.

On Tue, Jun 5, 2018, 04:24 cyk0521 notifications@github.com wrote:

I have ported darknet (https://github.com/pjreddie/darknet) for GPGPU-sim (cuBLAS and cuRAND were replaced by my own kernels), and executed VGGnet-16 (inference).

It works fine with my native GPU1080Ti, but it shows buffer overflow while loading weight data from a file (528MB). GPGPU-sim consumed more than 4GB memory while loading the weight file of VGGnet-16, and it showed buffer overflow which was not shown at native GPU.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gpgpu-sim/gpgpu-sim_distribution/issues/71#issuecomment-394625022, or mute the thread https://github.com/notifications/unsubscribe-auth/AVTYUh3FfBgRcYY0AstaMmRF6lYYhcFNks5t5kBQgaJpZM4T-qvD .

cyk0521 commented 6 years ago

I'm using vgg-16.cfg and imagenet1k.data to validate an input image. Here is my command line: ./darknet classifier predict cfg/imagenet1k.data cfg/vgg-16.cfg vgg-16.weights data/eagle.jpg

I face overflow error before execution of neural network layers.. GPU memory contents were corrupted while loading VGGnet-16 weight file.

As my observation, no GPU kernel was executed before memory corruption.

EdwarDu commented 6 years ago

@cyk0521 sorry for hijacking the thread, but is it possible to share your port of darknet? We are trying to use darknet with gpgpu-sim also facing the same issue regarding cuBLAS and cuRAND. Though we are using tiny yolo without memory overflow issue (for now).

abhi1212 commented 6 years ago

Yes, Even I would like to look at it once, have you posted it on git?

On Tue, Jun 12, 2018 at 1:35 PM, Boyang Du notifications@github.com wrote:

@cyk0521 https://github.com/cyk0521 sorry for hijacking the thread, but is it possible to share your port of darknet? We are trying to use darknet with gpgpu-sim also facing the same issue regarding cuBLAS and cuRAND. Though we are using tiny yolo without memory overflow issue (for now).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gpgpu-sim/gpgpu-sim_distribution/issues/71#issuecomment-396672354, or mute the thread https://github.com/notifications/unsubscribe-auth/AVTYUg1A1ASoYfIJFldyG91rUWgC3ZzLks5t7_vKgaJpZM4T-qvD .

gangmul12 commented 5 years ago

While running resnet50, i faced the same issue. Due to the overflow of the variable that holds address value, functional simulation fails.

capture