sakjain92 / Fractional-GPUs

Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions
148 stars 34 forks source link

Problems when running as Reverse Engineering mode #4

Open legical opened 1 year ago

legical commented 1 year ago

Build env: ubuntu16(4.15.0-142-generic), cuda9.1, modified nvidia driver, cmake 3.22, gcc/g++ 5 or 6 I used scripts/evlaution.sh to run as Reverse Engineering mode. Got the following prompt: _FGPU:Couldn't open shmem fgpuinit() failed Init failed Error: Reverse engineering code failed

After reading doc/FAQ.md , I tried to build with "FGPU_COMP_COLORING_ENABLE=ON" "FGPU_MEM_COLORING_ENABLED=ON" "FGPU_TEST_MEM_COLORING_ENABLED=ON" or "FGPU_COMP_COLORING_ENABLE=ON" "FGPU_MEM_COLORING_ENABLED=OFF" "FGPU_TEST_MEM_COLORING_ENABLED=ON"

But neither works correctly.

Can you tell me how I should solve this problem in order to make the project work properly?

sakjain92 commented 1 year ago

It's been a long time since I have worked on this project.

One thing which I notice is that the first error is "FGPU:Couldn't open shmem"

The FAQ mentions this: "My application is complaining that "FGPU:Couldn't open shmem" It says that "This indicates that the fgpu_server is not running." Have you checked whether the fgpu_server is actually running properly (via ps command maybe?)

See https://github.com/sakjain92/Fractional-GPUs/blob/master/doc/PORT.md#running-an-application also

Edit: I see that common.sh script already starts up the server in the fgpu_init function (called by evaluation.sh script). But it seems I am not checking if the command $BIN_PATH/$SERVER &> /dev/null & is running successfully or not without error since it's running in the background. Can you manually check whether this command is executing without error?

legical commented 1 year ago

Sorry to bother you and thank you very much for your work four years ago!

After reading scripts/evlaution.sh, I found this because it did not successfully close the X server. I tried to run F-GPU in tty shell (ubuntu 16.04 and GTX 1070) and sudo service lightdm stop. It started running successfully, but eventually errors appeared:

********************************
Running reverse engineering code
********************************
Finding DRAM Banks hash function
Finding threshold
Done:100.0%
Access Time: Threshold is: 561.023540 cycles, (Max: 579.000000 cycles, Min:483.000000 cycles)
***********************************************************************
Outputting DRAM access time trendline's raw data to /tmp/tmp.lH02oq0PhJ
***********************************************************************
***********************************************************************
Outputting DRAM access time histogram's raw data to /tmp/tmp.cKtBqNp97l
***********************************************************************
Finding DRAM row size (Might take a while)
GPUcheck: misaligned address /home/legical/Workspace/Fractional-GPUs/reverse_engineering/gpu.cu 404
Error: Reverse engineering code failed

This appears to have failed in finding a DRAM Banks hash function, but has successfully allocated contiguous memory and returned the starting physical address with using device_allocate_contigous function.

Actually, I need to implement both Compute partitioning and the device_allocate_contigous. But according to the doc, device_allocate_contigous is only enabled in reverse engineering mode. Are these two functions possible in one mode? If so, do you remember how to compile this project to achieve these two functions? Thank you very much!

My kernel version and other parameters attached:

Description:    Ubuntu 16.04.7 LTS

Linux version 4.15.0-142-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #146~16.04.1-Ubuntu SMP Tue Apr 13 09:27:15 UTC 2021

linux-image-4.15.0-142-generic             4.15.0-142.146~16.04.1                          linux-image-generic-hwe-16.04              4.15.0.142.137
linux-headers-4.15.0-142
linux-headers-4.15.0-142-generic
linux-source-4.4.0

NVIDIA-SMI 390.48                 Driver Version: 390.48 
Cuda compilation tools, release 9.1, V9.1.85

cmake-3.24.0
sakjain92 commented 1 year ago

GPUcheck: misaligned address /home/legical/Workspace/Fractional-GPUs/reverse_engineering/gpu.cu 404

This is a weird error to have come up. On line 404, I got this: gpuErrAssert(cudaDeviceSynchronize());

I don't know why this function should fail. Can you try and see if you can make this error go away by modifying the code or if you can figure out why it's actually failing (could be a potential bug in the code changes I made in the kernel driver also, but not so sure since my code worked for me on GTX1070 without issues)

As for having both compute partitioning & device_allocate_contiguous, as per BUILD.md, I think you need to have FGPU_COMP_COLORING_ENABLE & FGPU_TEST_MEM_COLORING_ENABLED enabled

See this note:

* **FGPU_COMP_COLORING_ENABLE**
    * Default - Enabled
    * Disabling this disables compute partitioning. In this case, each application utilizes whole GPU.
    * Enabling this enables compute partitioning. In this case, each application utilizes only subsets of total SMs in a GPU.

* **FGPU_TEST_MEM_COLORING_ENABLED**
    * Default - Disabled.
    * Enabling this enables contiguous memory allocation when using fgpu_memory_allocate() API.
    * This feature is useful only when reverse engineering a new GPU.
    * To be kept disabled during production mode (i.e. when running actual applications/benchmarks).
    * This requires both compute and memory coloring to be enabled.
* **Reverse engineering** (Not to be used to run external application)
    * *FGPU_COMP_COLORING_ENABLE* is enabled.
    * *FGPU_MEM_COLORING_ENABLED* is disabled.
    * *FGPU_TEST_MEM_COLORING_ENABLED* is enabled.
    * Only reverse engineering code is intended to run (one reverse engineering application at a time) in this scenario.
RavanN700 commented 6 months ago

Hello dear author(s),

First, thank you a lot for this amazing work!

I also tried running the project in reverse_engineering mode. However, it yields the following the same errors: FGPU:Couldn't open shmem fgpu_init() failed Init failed Error: Reverse engineering code failed

I followed the parameter adjustment correctly on Build document: Reverse engineering (Not to be used to run external application) FGPU_COMP_COLORING_ENABLE is enabled. FGPU_MEM_COLORING_ENABLED is disabled. FGPU_TEST_MEM_COLORING_ENABLED is enabled. Only reverse engineering code is intended to run (one reverse engineering application at a time) in this scenario.

Also, I looked at FAQand also Port where you mention:

Prior to running an application that uses FGPU, Nvidia MPS and FGPU server needs to be running.

sudo $PROJ_DIR/scripts/mps_init.sh cd $PROJ_DIR ./fgpu_server

However, it is not able to find fgpu_server in $PROJ_DIR.

I run the project in tty shell while stopping the X server with "sudo service lightdm stop" command.

Could you help me figure this out?

Thanks in advance.

sakjain92 commented 6 months ago

To build fgpu_server, you need to build the source code. Please refer to BUILD.md in doc folder. specifically this section:

Build

To build the FGPU code, follow these steps

cd $PROJ_DIR
mkdir build
cd build
cmake ..
make

After these steps, in the build directory, following files should be present:

  • libfractional_gpu.so - Link external applications with this library
  • fgpu_server - Server that is required by FGPU applications.
RavanN700 commented 6 months ago

Thanks a lot for the response.

Now, the fgpu_server is available under $PROJ_DIR/build. As I run the fgpu_server, I get the following error: FGPU:Unknown Cuda device FGPU:Server Terminating. Waiting for device to be free

I think this is due to an unsupported GPU. However, I added the name of GPU GeForce GTX 1060 3GB to the list of supported GPUs in common.sh script file.

I think I miss where else exactly I need to add the name of the GPU. Could you help me figure out this issue?

Thanks in advance.

sakjain92 commented 6 months ago

See chatgpt response here: https://chat.openai.com/share/07c4b3f3-a778-4e95-82fc-299d8bebb65e

Try running nvidia-smi command and checking if Nvidia driver detects your GPU in the system.

RavanN700 commented 6 months ago

Thanks a lot! I have been able to successfully run the tool. The only thing I need to do is to make my GPU supported in the tool.

Best regards.