Open legical opened 1 year ago
It's been a long time since I have worked on this project.
One thing which I notice is that the first error is "FGPU:Couldn't open shmem"
The FAQ mentions this: "My application is complaining that "FGPU:Couldn't open shmem"
It says that "This indicates that the fgpu_server is not running."
Have you checked whether the fgpu_server is actually running properly (via ps
command maybe?)
See https://github.com/sakjain92/Fractional-GPUs/blob/master/doc/PORT.md#running-an-application also
Edit: I see that common.sh script already starts up the server in the fgpu_init function (called by evaluation.sh script). But it seems I am not checking if the command $BIN_PATH/$SERVER &> /dev/null &
is running successfully or not without error since it's running in the background. Can you manually check whether this command is executing without error?
Sorry to bother you and thank you very much for your work four years ago!
After reading scripts/evlaution.sh, I found this because it did not successfully close the X server.
I tried to run F-GPU in tty shell (ubuntu 16.04 and GTX 1070) and sudo service lightdm stop
. It started running successfully, but eventually errors appeared:
********************************
Running reverse engineering code
********************************
Finding DRAM Banks hash function
Finding threshold
Done:100.0%
Access Time: Threshold is: 561.023540 cycles, (Max: 579.000000 cycles, Min:483.000000 cycles)
***********************************************************************
Outputting DRAM access time trendline's raw data to /tmp/tmp.lH02oq0PhJ
***********************************************************************
***********************************************************************
Outputting DRAM access time histogram's raw data to /tmp/tmp.cKtBqNp97l
***********************************************************************
Finding DRAM row size (Might take a while)
GPUcheck: misaligned address /home/legical/Workspace/Fractional-GPUs/reverse_engineering/gpu.cu 404
Error: Reverse engineering code failed
This appears to have failed in finding a DRAM Banks hash function, but has successfully allocated contiguous memory and returned the starting physical address with using device_allocate_contigous
function.
Actually, I need to implement both Compute partitioning and the device_allocate_contigous
. But according to the doc, device_allocate_contigous
is only enabled in reverse engineering mode. Are these two functions possible in one mode? If so, do you remember how to compile this project to achieve these two functions? Thank you very much!
My kernel version and other parameters attached:
Description: Ubuntu 16.04.7 LTS
Linux version 4.15.0-142-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #146~16.04.1-Ubuntu SMP Tue Apr 13 09:27:15 UTC 2021
linux-image-4.15.0-142-generic 4.15.0-142.146~16.04.1 linux-image-generic-hwe-16.04 4.15.0.142.137
linux-headers-4.15.0-142
linux-headers-4.15.0-142-generic
linux-source-4.4.0
NVIDIA-SMI 390.48 Driver Version: 390.48
Cuda compilation tools, release 9.1, V9.1.85
cmake-3.24.0
GPUcheck: misaligned address /home/legical/Workspace/Fractional-GPUs/reverse_engineering/gpu.cu 404
This is a weird error to have come up. On line 404, I got this: gpuErrAssert(cudaDeviceSynchronize());
I don't know why this function should fail. Can you try and see if you can make this error go away by modifying the code or if you can figure out why it's actually failing (could be a potential bug in the code changes I made in the kernel driver also, but not so sure since my code worked for me on GTX1070 without issues)
As for having both compute partitioning & device_allocate_contiguous, as per BUILD.md, I think you need to have FGPU_COMP_COLORING_ENABLE & FGPU_TEST_MEM_COLORING_ENABLED enabled
See this note:
* **FGPU_COMP_COLORING_ENABLE**
* Default - Enabled
* Disabling this disables compute partitioning. In this case, each application utilizes whole GPU.
* Enabling this enables compute partitioning. In this case, each application utilizes only subsets of total SMs in a GPU.
* **FGPU_TEST_MEM_COLORING_ENABLED**
* Default - Disabled.
* Enabling this enables contiguous memory allocation when using fgpu_memory_allocate() API.
* This feature is useful only when reverse engineering a new GPU.
* To be kept disabled during production mode (i.e. when running actual applications/benchmarks).
* This requires both compute and memory coloring to be enabled.
* **Reverse engineering** (Not to be used to run external application)
* *FGPU_COMP_COLORING_ENABLE* is enabled.
* *FGPU_MEM_COLORING_ENABLED* is disabled.
* *FGPU_TEST_MEM_COLORING_ENABLED* is enabled.
* Only reverse engineering code is intended to run (one reverse engineering application at a time) in this scenario.
Hello dear author(s),
First, thank you a lot for this amazing work!
I also tried running the project in reverse_engineering mode. However, it yields the following the same errors: FGPU:Couldn't open shmem fgpu_init() failed Init failed Error: Reverse engineering code failed
I followed the parameter adjustment correctly on Build document: Reverse engineering (Not to be used to run external application) FGPU_COMP_COLORING_ENABLE is enabled. FGPU_MEM_COLORING_ENABLED is disabled. FGPU_TEST_MEM_COLORING_ENABLED is enabled. Only reverse engineering code is intended to run (one reverse engineering application at a time) in this scenario.
Also, I looked at FAQand also Port where you mention:
Prior to running an application that uses FGPU, Nvidia MPS and FGPU server needs to be running.
sudo $PROJ_DIR/scripts/mps_init.sh cd $PROJ_DIR ./fgpu_server
However, it is not able to find fgpu_server in $PROJ_DIR.
I run the project in tty shell while stopping the X server with "sudo service lightdm stop" command.
Could you help me figure this out?
Thanks in advance.
To build fgpu_server, you need to build the source code. Please refer to BUILD.md in doc folder. specifically this section:
Build
To build the FGPU code, follow these steps
cd $PROJ_DIR mkdir build cd build cmake .. make
After these steps, in the build directory, following files should be present:
- libfractional_gpu.so - Link external applications with this library
- fgpu_server - Server that is required by FGPU applications.
Thanks a lot for the response.
Now, the fgpu_server is available under $PROJ_DIR/build. As I run the fgpu_server, I get the following error: FGPU:Unknown Cuda device FGPU:Server Terminating. Waiting for device to be free
I think this is due to an unsupported GPU. However, I added the name of GPU GeForce GTX 1060 3GB to the list of supported GPUs in common.sh script file.
I think I miss where else exactly I need to add the name of the GPU. Could you help me figure out this issue?
Thanks in advance.
See chatgpt response here: https://chat.openai.com/share/07c4b3f3-a778-4e95-82fc-299d8bebb65e
Try running nvidia-smi
command and checking if Nvidia driver detects your GPU in the system.
Thanks a lot! I have been able to successfully run the tool. The only thing I need to do is to make my GPU supported in the tool.
Best regards.
Build env: ubuntu16(4.15.0-142-generic), cuda9.1, modified nvidia driver, cmake 3.22, gcc/g++ 5 or 6 I used scripts/evlaution.sh to run as Reverse Engineering mode. Got the following prompt: _FGPU:Couldn't open shmem fgpuinit() failed Init failed Error: Reverse engineering code failed
After reading doc/FAQ.md , I tried to build with "FGPU_COMP_COLORING_ENABLE=ON" "FGPU_MEM_COLORING_ENABLED=ON" "FGPU_TEST_MEM_COLORING_ENABLED=ON" or "FGPU_COMP_COLORING_ENABLE=ON" "FGPU_MEM_COLORING_ENABLED=OFF" "FGPU_TEST_MEM_COLORING_ENABLED=ON"
But neither works correctly.
Can you tell me how I should solve this problem in order to make the project work properly?