jeongminpark417 / GIDS

26 stars 7 forks source link

cmake error after success install bam block driver and nvm block benchmark with cuda12.6 #22

Open gaowayne opened 1 month ago

gaowayne commented 1 month ago

hello expert, I am suffering this cmake error, could you please take a look at. I can build bam and run benchmark well under my ubuntu20.04.3 and installed NV driver 560, CUDA 12.6

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake ..
-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is unknown
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/bin/nvcc
-- Check for working CUDA compiler: /usr/bin/nvcc -- broken
CMake Error at /usr/share/cmake-3.16/Modules/CMakeTestCUDACompiler.cmake:46 (message):
  The CUDA compiler

    "/usr/bin/nvcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/make cmTC_8f6f1/fast && /usr/bin/make -f CMakeFiles/cmTC_8f6f1.dir/build.make CMakeFiles/cmTC_8f6f1.dir/build
    make[1]: Entering directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_8f6f1.dir/main.cu.o
    /usr/bin/nvcc     -x cu -c /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_8f6f1.dir/main.cu.o
    ptxas fatal   : Value 'sm_30' is not defined for option 'gpu-name'
    make[1]: *** [CMakeFiles/cmTC_8f6f1.dir/build.make:66: CMakeFiles/cmTC_8f6f1.dir/main.cu.o] Error 255
    make[1]: Leaving directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    make: *** [Makefile:121: cmTC_8f6f1/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (PROJECT)

-- Configuring incomplete, errors occurred!
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeOutput.log".
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeError.log".
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# 
WWWzq-01 commented 1 month ago

Please check the version of /usr/bin/nvcc. Its version should match the nvcc version you used to compile BaM.

gaowayne commented 1 month ago

Please check the version of /usr/bin/nvcc. Its version should match the nvcc version you used to compile BaM.

yes, I confirmed it matched.

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# nvidia-smi
Sun Oct 20 04:00:31 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:8A:00.0 Off |                    0 |
| N/A   29C    P8             32W /  350W |      23MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2499      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A      3542      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# 
WWWzq-01 commented 1 month ago

Please execute which nvcc to see if the output is /usr/bin/nvcc. Alternatively, run /usr/bin/nvcc -V to verify the version of nvcc that is actually being called during the cmake process.

gaowayne commented 1 month ago

Please execute which nvcc to see if the output is /usr/bin/nvcc. Alternatively, run /usr/bin/nvcc -V to verify the version of nvcc that is actually being called during the cmake process.

man, you are quite correct :) what is best way to fix this?

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# which nvcc
/usr/local/cuda/bin/nvcc
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# /usr/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# 
gaowayne commented 1 month ago

@WWWzq-01 buddy, I manually copy 12.6 nvcc into usr/bin, now this CUDA error gone. I got GCC version error. I can build BAM with GCC-9 that is default to ubuntu20.04.3

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake ..
-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is unknown
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/bin/nvcc
-- Check for working CUDA compiler: /usr/bin/nvcc -- broken
CMake Error at /usr/share/cmake-3.16/Modules/CMakeTestCUDACompiler.cmake:46 (message):
  The CUDA compiler

    "/usr/bin/nvcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/make cmTC_740a5/fast && /usr/bin/make -f CMakeFiles/cmTC_740a5.dir/build.make CMakeFiles/cmTC_740a5.dir/build
    make[1]: Entering directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_740a5.dir/main.cu.o
    /usr/bin/nvcc     -x cu -c /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_740a5.dir/main.cu.o
    In file included from /usr/include/cuda_runtime.h:83,
                     from <command-line>:
    /usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
          |  ^~~~~
    make[1]: *** [CMakeFiles/cmTC_740a5.dir/build.make:66: CMakeFiles/cmTC_740a5.dir/main.cu.o] Error 1
    make[1]: Leaving directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    make: *** [Makefile:121: cmTC_740a5/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (PROJECT)

-- Configuring incomplete, errors occurred!
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeOutput.log".
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeError.log".
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# 
gaowayne commented 1 month ago

@WWWzq-01 I fixed the include headers by copying cuda 12.6 into usr/include, last error gone. I met new error below

cicc not found.

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake .. -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))")  -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))")
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
-- The CUDA compiler identification is unknown
-- Check for working CUDA compiler: /usr/bin/nvcc
-- Check for working CUDA compiler: /usr/bin/nvcc -- broken
CMake Error at /usr/share/cmake-3.16/Modules/CMakeTestCUDACompiler.cmake:46 (message):
  The CUDA compiler

    "/usr/bin/nvcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/make cmTC_66f66/fast && /usr/bin/make -f CMakeFiles/cmTC_66f66.dir/build.make CMakeFiles/cmTC_66f66.dir/build
    make[1]: Entering directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_66f66.dir/main.cu.o
    /usr/bin/nvcc     -x cu -c /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_66f66.dir/main.cu.o
    sh: 1: cicc: not found
    make[1]: *** [CMakeFiles/cmTC_66f66.dir/build.make:66: CMakeFiles/cmTC_66f66.dir/main.cu.o] Error 127
    make[1]: Leaving directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    make: *** [Makefile:121: cmTC_66f66/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (PROJECT)

-- Configuring incomplete, errors occurred!
gaowayne commented 1 month ago

I have fixed cicc tool path problem. now I saw the link error below

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake .. -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))")  -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))")
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is unknown
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/bin/nvcc
-- Check for working CUDA compiler: /usr/bin/nvcc -- broken
CMake Error at /usr/share/cmake-3.16/Modules/CMakeTestCUDACompiler.cmake:46 (message):
  The CUDA compiler

    "/usr/bin/nvcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/make cmTC_23d39/fast && /usr/bin/make -f CMakeFiles/cmTC_23d39.dir/build.make CMakeFiles/cmTC_23d39.dir/build
    make[1]: Entering directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    Building CUDA object CMakeFiles/cmTC_23d39.dir/main.cu.o
    /usr/bin/nvcc     -x cu -c /root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_23d39.dir/main.cu.o
    Linking CUDA executable cmTC_23d39
    /usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_23d39.dir/link.txt --verbose=1
    ""   CMakeFiles/cmTC_23d39.dir/main.cu.o -o cmTC_23d39 
    Error running link command: No such file or directory
    make[1]: *** [CMakeFiles/cmTC_23d39.dir/build.make:87: cmTC_23d39] Error 2
    make[1]: Leaving directory '/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeTmp'
    make: *** [Makefile:121: cmTC_23d39/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (PROJECT)

-- Configuring incomplete, errors occurred!
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeOutput.log".
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeError.log".
gaowayne commented 1 month ago

I fixed the linker problem by copy the bin/crt link.stub. here is new error

failed to parsed CUDA nvcc implicit link information,

Failed to parsed CUDA nvcc implicit link information:
    #$ _THERE_=/usr/bin
    #$ _TARGET_SIZE_=
    #$ _TARGET_DIR_=
    #$ _TARGET_SIZE_=64
    #$ rm tmp/a_dlink.reg.c

Failed to parsed CUDA nvcc implicit link information:
    #$ _THERE_=/usr/bin
    #$ _TARGET_SIZE_=
    #$ _TARGET_DIR_=
    #$ _TARGET_SIZE_=64
    #$ rm tmp/a_dlink.reg.c

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (PROJECT)

Failed to parsed CUDA nvcc implicit link information:
Failed to parsed CUDA nvcc implicit link information:
    #$ _THERE_=/usr/bin
    #$ _TARGET_SIZE_=
    #$ _TARGET_DIR_=
    #$ _TARGET_SIZE_=64
    #$ rm tmp/a_dlink.reg.c

Failed to parsed CUDA nvcc implicit link information:
Failed to parsed CUDA nvcc implicit link information:
    #$ _THERE_=/usr/bin
    #$ _TARGET_SIZE_=
    #$ _TARGET_DIR_=
    #$ _TARGET_SIZE_=64
    #$ rm tmp/a_dlink.reg.c

    #$ _CUDART_=cudart
    #$ _HERE_=/usr/bin
    #$ _THERE_=/usr/bin
    #$ _TARGET_SIZE_=
    #$ _TARGET_DIR_=
    #$ _TARGET_SIZE_=64
    #$ rm tmp/a_dlink.reg.c
    #$ gcc -D__CUDA_ARCH_LIST__=520 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__   -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=77 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "CMakeCUDACompilerId.cu" -o "tmp/CMakeCUDACompilerId.cpp4.ii"
    #$ cudafe++ --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "CMakeCUDACompilerId.cu" --orig_src_path_name "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/3.16.3/CompilerIdCUDA/CMakeCUDACompilerId.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "tmp/CMakeCUDACompilerId.cudafe1.cpp" --stub_file_name "CMakeCUDACompilerId.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "tmp/CMakeCUDACompilerId.module_id" "tmp/CMakeCUDACompilerId.cpp4.ii"
    #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__   -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=77 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "CMakeCUDACompilerId.cu" -o "tmp/CMakeCUDACompilerId.cpp1.ii"
    #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "CMakeCUDACompilerId.cu" --orig_src_path_name "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/3.16.3/CompilerIdCUDA/CMakeCUDACompilerId.cu" --allow_managed  -arch compute_52 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "CMakeCUDACompilerId.fatbin.c" -tused --module_id_file_name "tmp/CMakeCUDACompilerId.module_id" --gen_c_file_name "tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name "tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name "tmp/CMakeCUDACompilerId.cudafe1.gpu"  "tmp/CMakeCUDACompilerId.cpp1.ii" -o "tmp/CMakeCUDACompilerId.ptx"
    #$ ptxas -arch=sm_52 -m64 "tmp/CMakeCUDACompilerId.ptx"  -o "tmp/CMakeCUDACompilerId.sm_52.cubin"
    #$ fatbinary --create="tmp/CMakeCUDACompilerId.fatbin" -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=52,file=tmp/CMakeCUDACompilerId.sm_52.cubin" "--image3=kind=ptx,sm=52,file=tmp/CMakeCUDACompilerId.ptx" --embedded-fatbin="tmp/CMakeCUDACompilerId.fatbin.c"
    #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -D__NV_LEGACY_LAUNCH -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi -m64 "tmp/CMakeCUDACompilerId.cudafe1.cpp" -o "tmp/CMakeCUDACompilerId.o"
    #$ nvlink -m64 --arch=sm_52 --register-link-binaries="tmp/a_dlink.reg.c"  -cpu-arch=X86_64 "tmp/CMakeCUDACompilerId.o"  -lcudadevrt  -o "tmp/a_dlink.sm_52.cubin" --host-ccbin "gcc"
    #$ fatbinary --create="tmp/a_dlink.fatbin" -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -link "--image3=kind=elf,sm=52,file=tmp/a_dlink.sm_52.cubin" --embedded-fatbin="tmp/a_dlink.fatbin.c"
    #$ gcc -D__CUDA_ARCH_LIST__=520 -D__NV_LEGACY_LAUNCH -c -x c++ -DFATBINFILE="\"tmp/a_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"tmp/a_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__  -Wno-psabi  -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=77 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -m64 "/usr/bin/crt/link.stub" -o "tmp/a_dlink.o"
    #$ g++ -D__CUDA_ARCH_LIST__=520 -D__NV_LEGACY_LAUNCH -m64 -Wl,--start-group "tmp/a_dlink.o" "tmp/CMakeCUDACompilerId.o"  -lcudadevrt  -lcudart_static  -lrt -lpthread  -ldl  -Wl,--end-group -o "a.out"
gaowayne commented 1 month ago

enable cmake trace, we can see this error, need help!~

/snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake(130):  else()
   Called from: [3]     /snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake
                [2]     /snap/cmake/1417/share/cmake-3.30/Modules/CMakeDetermineCUDACompiler.cmake
                [1]     /root/wayne/gids/GIDS/gids_module/CMakeLists.txt
/snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake(131):  message(CONFIGURE_LOG Failed to parse CUDA nvcc implicit link information:\n${_nvcc_log}\n\n )
   Called from: [3]     /snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake
                [2]     /snap/cmake/1417/share/cmake-3.30/Modules/CMakeDetermineCUDACompiler.cmake
                [1]     /root/wayne/gids/GIDS/gids_module/CMakeLists.txt
/snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake(133):  message(FATAL_ERROR Failed to extract nvcc implicit link line. )
   Called from: [3]     /snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake
                [2]     /snap/cmake/1417/share/cmake-3.30/Modules/CMakeDetermineCUDACompiler.cmake
                [1]     /root/wayne/gids/GIDS/gids_module/CMakeLists.txt
CMake Error at /snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake:133 (message):
  Failed to extract nvcc implicit link line.
Call Stack (most recent call first):
  /snap/cmake/1417/share/cmake-3.30/Modules/CMakeDetermineCUDACompiler.cmake:242 (cmake_nvcc_parse_implicit_info)
  CMakeLists.txt:2 (PROJECT)

   Called from: [3]     /snap/cmake/1417/share/cmake-3.30/Modules/Internal/CMakeNVCCParseImplicitInfo.cmake
                [2]     /snap/cmake/1417/share/cmake-3.30/Modules/CMakeDetermineCUDACompiler.cmake
                [1]     /root/wayne/gids/GIDS/gids_module/CMakeLists.txt
-- Configuring incomplete, errors occurred!
WWWzq-01 commented 1 month ago

You can specify the CUDA version directly during the CMake process with the following command:

cmake -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc

This command sets the path to the desired CUDA toolkit and nvcc compiler, ensuring that CMake uses the specified version during the build process.

If you still encounter GCC errors, you can specify the GCC version directly using the following command:

cmake -DCMAKE_C_COMPILER=/usr/bin/gcc-9 -DCMAKE_CXX_COMPILER=/usr/bin/g++-9 -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc

Here, replace /usr/bin/gcc-9 with the desired GCC version. To find the default GCC path, you can use the following command:

ll $(which gcc)

This will display the path of the default gcc version, which you can then use in your CMake command.

gaowayne commented 1 month ago
cmake -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc

hello buddy, your 1st method works great now, I only see python related error. :)

it is very promising now, can you shed more light on this? :)

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc .. \
> -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))")  \
> -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))")
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is NVIDIA 12.6.77
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda-12.6/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-12.6/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
CMake Error at CMakeLists.txt:29 (FIND_PACKAGE):
  By not providing "Findpybind11.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "pybind11",
  but CMake did not find one.

  Could not find a package configuration file provided by "pybind11" with any
  of the following names:

    pybind11Config.cmake
    pybind11-config.cmake

  Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set
  "pybind11_DIR" to a directory containing one of the above files.  If
  "pybind11" provides a separate development package or SDK, be sure it has
  been installed.

-- Configuring incomplete, errors occurred!
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMake
gaowayne commented 1 month ago

this is latest error

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# cmake -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc .. -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))")  -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))")
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
-- Found PythonInterp: /usr/bin/python3.8 (found version "3.8.10") 
-- Found PythonInterp: /usr/bin/python3.8 (found suitable version "3.8.10", minimum required is "3") 
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find PythonLibs (missing: PYTHON_INCLUDE_DIRS) (Required is at
  least version "3")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindPythonLibs.cmake:310 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:45 (FIND_PACKAGE)

-- Configuring incomplete, errors occurred!
See also "/root/wayne/gids/GIDS/gids_module/build/CMakeFiles/CMakeOutput.log".
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/gids_module/build# 
WWWzq-01 commented 1 month ago

To make the python command point to python3, you can create a symbolic link as follows:

   sudo ln -s /usr/bin/python3 /usr/bin/python

Or, you can modify the command to use python3 instead of python.

gaowayne commented 1 month ago

@WWWzq-01 all are great. now I am trying to run UT. still got some depends error in python under ubuntu20.04

any good idea on this libssl.so.3?

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./gids_unit_test.sh
/usr/local/lib/python3.8/dist-packages/scipy/__init__.py:143: UserWarning: A NumPy version >=1.19.5 and <1.27.0 is required for this version of SciPy (detected version 1.17.4)
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Traceback (most recent call last):
  File "GIDS_unit_test.py", line 2, in <module>
    import dgl
  File "/usr/local/lib/python3.8/dist-packages/dgl/__init__.py", line 16, in <module>
    from . import (
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/__init__.py", line 13, in <module>
    from .dataloader import *
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dataloader.py", line 27, in <module>
    from ..distributed import DistGraph
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/__init__.py", line 5, in <module>
    from .dist_graph import DistGraph, DistGraphServer, edge_split, node_split
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 11, in <module>
    from .. import backend as F, graphbolt as gb, heterograph_index
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/__init__.py", line 8, in <module>
    from .base import *
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/base.py", line 8, in <module>
    from torchdata.datapipes.iter import IterDataPipe
  File "/usr/local/lib/python3.8/dist-packages/torchdata/datapipes/__init__.py", line 9, in <module>
    from torchdata import _extension  # noqa: F401
  File "/usr/local/lib/python3.8/dist-packages/torchdata/__init__.py", line 29, in __getattr__
    return importlib.import_module("." + name, __name__)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.8/dist-packages/torchdata/_extension.py", line 34, in <module>
    _init_extension()
  File "/usr/local/lib/python3.8/dist-packages/torchdata/_extension.py", line 31, in _init_extension
    from torchdata import _torchdata as _torchdata
  File "/usr/local/lib/python3.8/dist-packages/torchdata/__init__.py", line 29, in __getattr__
    return importlib.import_module("." + name, __name__)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libssl.so.3: cannot open shared object file: No such file or directory
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# 
gaowayne commented 1 month ago

@WWWzq-01 I already fix the libssl issue by add LD path. I tried, but this one is really hard. could you please check?

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./gids_unit_test.sh 
/usr/local/lib/python3.8/dist-packages/torchdata/datapipes/__init__.py:18: UserWarning: 
################################################################################
WARNING!
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
to learn more and leave feedback.
################################################################################

  deprecation_warning()
Traceback (most recent call last):
  File "GIDS_unit_test.py", line 2, in <module>
    import dgl
  File "/usr/local/lib/python3.8/dist-packages/dgl/__init__.py", line 16, in <module>
    from . import (
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/__init__.py", line 13, in <module>
    from .dataloader import *
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dataloader.py", line 27, in <module>
    from ..distributed import DistGraph
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/__init__.py", line 5, in <module>
    from .dist_graph import DistGraph, DistGraphServer, edge_split, node_split
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 11, in <module>
    from .. import backend as F, graphbolt as gb, heterograph_index
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/__init__.py", line 55, in <module>
    load_graphbolt()
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/__init__.py", line 45, in load_graphbolt
    raise FileNotFoundError(
FileNotFoundError: Cannot find DGL C++ graphbolt library at /usr/local/lib/python3.8/dist-packages/dgl/graphbolt/libgraphbolt_pytorch_2.4.1.so
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./gids_unit_test.sh 
Traceback (most recent call last):
  File "GIDS_unit_test.py", line 2, in <module>
    import dgl
ModuleNotFoundError: No module named 'dgl'
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip show dgl
WARNING: Package(s) not found: dgl
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip install dgl-cu126
ERROR: Could not find a version that satisfies the requirement dgl-cu126 (from versions: none)
ERROR: No matching distribution found for dgl-cu126
WWWzq-01 commented 1 month ago

There is probably no CUDA 12.6 version of DGL. You can refer to this documentation to install DGL and try installing the 12.4 version instead.

gaowayne commented 1 month ago

There is probably no CUDA 12.6 version of DGL. You can refer to this documentation to install DGL and try installing the 12.4 version instead.

I install 12.4 CUDA DGL python package with above link. it is done. it get below error.

in the below path, there is no this file /usr/local/lib/python3.8/dist-packages/dgl/graphbolt/libgraphbolt_pytorch_2.4.0.so.

root@salab-hpedl380g11-01:/usr/local/lib/python3.8/dist-packages/dgl/graphbolt# ls -l
total 537620
-rw-r--r-- 1 root staff     16274 Oct 20 13:05 base.py
-rw-r--r-- 1 root staff      7013 Oct 20 13:05 dataloader.py
drwxr-sr-x 3 root staff      4096 Oct 20 13:05 datapipes
-rw-r--r-- 1 root staff      2751 Oct 20 13:05 dataset.py
-rw-r--r-- 1 root staff      5831 Oct 20 13:05 external_utils.py
-rw-r--r-- 1 root staff      9672 Oct 20 13:05 feature_fetcher.py
-rw-r--r-- 1 root staff     10578 Oct 20 13:05 feature_store.py
drwxr-sr-x 3 root staff      4096 Oct 20 13:05 impl
-rw-r--r-- 1 root staff      4181 Oct 20 13:05 __init__.py
drwxr-sr-x 3 root staff      4096 Oct 20 13:05 internal
-rw-r--r-- 1 root staff     12142 Oct 20 13:05 internal_utils.py
-rw-r--r-- 1 root staff     24537 Oct 20 13:05 item_sampler.py
-rw-r--r-- 1 root staff     16115 Oct 20 13:05 itemset.py
-rwxr-xr-x 1 root staff 275158000 Oct 20 13:05 libgraphbolt_pytorch_2.3.0.so
-rwxr-xr-x 1 root staff 275158000 Oct 20 13:05 libgraphbolt_pytorch_2.3.1.so
-rw-r--r-- 1 root staff     15590 Oct 20 13:05 minibatch.py
-rw-r--r-- 1 root staff      1109 Oct 20 13:05 minibatch_transformer.py
-rw-r--r-- 1 root staff      3292 Oct 20 13:05 negative_sampler.py
drwxr-sr-x 2 root staff      4096 Oct 20 13:05 __pycache__
-rw-r--r-- 1 root staff     16693 Oct 20 13:05 sampled_subgraph.py
-rw-r--r-- 1 root staff      2295 Oct 20 13:05 sampling_graph.py
-rw-r--r-- 1 root staff     10783 Oct 20 13:05 subgraph_sampler.py
root@salab-hpedl380g11-01:/usr/local/lib/python3.8/dist-packages/dgl/graphbolt# 

here is error log

Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.8/dist-packages (from pydantic>=2.0->dgl) (2.23.4)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from pydantic>=2.0->dgl) (0.7.0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas->dgl) (1.14.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.8/dist-packages (from sympy->torch<=2.4.0->dgl) (1.3.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.8/dist-packages (from nvidia-cusolver-cu12==11.4.5.107; platform_system == "Linux" and platform_machine == "x86_64"->torch<=2.4.0->dgl) (12.4.99)
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./gids_unit_test.sh 
Traceback (most recent call last):
  File "GIDS_unit_test.py", line 16, in <module>
    import GIDS
  File "/usr/local/lib/python3.8/dist-packages/GIDS/__init__.py", line 2, in <module>
    from .GIDS import GIDS
  File "/usr/local/lib/python3.8/dist-packages/GIDS/GIDS.py", line 24, in <module>
    from dgl.distributed import DistGraph
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/__init__.py", line 10, in <module>
    from .dist_graph import DistGraph, DistGraphServer, edge_split, node_split
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 12, in <module>
    from .. import backend as F, graphbolt as gb, heterograph_index
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/__init__.py", line 81, in <module>
    load_graphbolt()
  File "/usr/local/lib/python3.8/dist-packages/dgl/graphbolt/__init__.py", line 66, in load_graphbolt
    raise FileNotFoundError(
FileNotFoundError: Unable to locate the DGL C++ GraphBolt library at /usr/local/lib/python3.8/dist-packages/dgl/graphbolt/libgraphbolt_pytorch_2.4.0.so. This error typically occurs due to a version mismatch between the installed DGL and the PyTorch version you are currently using. Please ensure that your DGL installation is compatible with your PyTorch version. For more information, refer to the installation guide at https://www.dgl.ai/pages/start.html.
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# 
WWWzq-01 commented 1 month ago

Did you match the corresponding PyTorch version when installing DGL?You can use the following command to display the list of installed packages using pip:

pip list
gaowayne commented 1 month ago

Did you match the corresponding PyTorch version when installing DGL?You can use the following command to display the list of installed packages using pip:

pip list

thank you so much man for your great help!~~

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip list
Package                  Version             
------------------------ --------------------
annotated-types          0.7.0               
apturl                   0.5.2               
attrs                    19.3.0              
Automat                  0.8.0               
bcrypt                   3.1.7               
blinker                  1.4                 
Brlapi                   0.7.0               
certifi                  2019.11.28          
chardet                  3.0.4               
Click                    7.0                 
cloud-init               24.3.1              
cmake-cpp-pybind11       0.1.1               
colorama                 0.4.3               
command-not-found        0.3                 
configobj                5.0.6               
constantly               15.1.0              
cryptography             2.8                 
cupshelpers              1.0                 

louis                    3.12.0              
macaroonbakery           1.3.1               
Mako                     1.1.0               
MarkupSafe               1.1.0               
monotonic                1.5                 
more-itertools           4.2.0               
mpmath                   1.3.0               
netifaces                0.10.4              
networkx                 3.1                 
numpy                    1.24.4              
nvidia-cublas-cu12       12.4.2.65           
nvidia-cuda-cupti-cu12   12.4.99             
nvidia-cuda-nvrtc-cu12   12.4.99             
nvidia-cuda-runtime-cu12 12.4.99             
nvidia-cudnn-cu12        9.1.0.70            
nvidia-cufft-cu12        11.2.0.44           
nvidia-curand-cu12       10.3.5.119          
nvidia-cusolver-cu12     11.6.0.99           
nvidia-cusparse-cu12     12.3.0.142          
nvidia-nccl-cu12         2.20.5              
nvidia-nvjitlink-cu12    12.4.99             
nvidia-nvtx-cu12         12.4.99             
nvtx                     0.2.10              
oauthlib                 3.1.0               
olefile                  0.46                
packaging                24.1                
pandas                   2.0.3               
paramiko                 2.6.0               
pexpect                  4.6.0               
Pillow                   7.0.0               
pip                      20.0.2              
protobuf                 3.6.1               
psutil                   6.1.0               
pyasn1                   0.4.2               
pyasn1-modules           0.2.1               

requests                 2.22.0              
requests-unixsocket      0.2.0               
scikit-learn             1.3.2               
scipy                    1.10.1              
screen-resolution-extra  0.0.0               
SecretStorage            2.3.1               
service-identity         18.1.0              
setuptools               45.2.0              
simplejson               3.16.0              
six                      1.14.0              
sos                      4.5.6               
ssh-import-id            5.10                
sympy                    1.13.3              
systemd-python           234                 
threadpoolctl            3.5.0               
torch                    2.4.1+cu124         
torchaudio               2.4.1+cu124         
torchdata                0.8.0               
torchvision              0.19.1+cu124        
tqdm                     4.66.5              
triton                   3.0.0               
Twisted                  18.9.0              
typing-extensions        4.12.2              
tzdata                   2024.2              
ubuntu-drivers-common    0.0.0               
ubuntu-pro-client        8001                
ufw                      0.36                
unattended-upgrades      0.1                 
urllib3                  1.25.8              
usb-creator              0.3.7               
wadllib                  1.3.3               
wheel                    0.34.2              
xkit                     0.0.0               
zipp                     1.0.0               
zope.interface           4.7.1 
WWWzq-01 commented 1 month ago

Did you install DGL using pip? It doesn’t show the DGL package here.

Did you match the corresponding PyTorch version when installing DGL?

gaowayne commented 1 month ago

Did you install DGL using pip? It doesn’t show the DGL package here.

Did you match the corresponding PyTorch version when installing DGL?

I install both pytorch and DGL with pip.

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip list | grep dgl
dgl                      2.4.0+cu121         
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip list | grep pytorch
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# pip list | grep torch
torch                    2.4.1+cu124         
torchaudio               2.4.1+cu124         
torchdata                0.8.0               
torchvision              0.19.1+cu124        
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# 
WWWzq-01 commented 1 month ago

try to install dgl with cuda-12.4?

gaowayne commented 1 month ago

try to install dgl with cuda-12.4?

thank you so much man. I uninstall dgl then install again with below. now the unit test can start, there is runtime error, but depends looks well. :)

  437  pip uninstall  dgl 
  438  pip install  dgl -f https://data.dgl.ai/wheels/torch-2.4/cu124/repo.html
msharmavikram commented 1 month ago

@gaowayne what did I educate in BaM? Never create a hack for dependency. Please follow standard version matching mechanism.

@jeongminpark417 can you create a docker file that manages these dependency automatically. I believe users should not manually do these things. This is broken approach and will fail.

gaowayne commented 1 month ago

@msharmavikram @WWWzq-01 @jeongminpark417 thank you all so much. it is working from my side, now I think I just need change some hard code dataset path to run through it. :)

I am downloading the full dataset.

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./test1.sh
GIDS DataLoader Setting
GIDS:  True
CPU Feature Buffer:  True
Window Buffering:  True
Storage Access Accumulator:  True
Dataset: IGB
Traceback (most recent call last):
  File "heterogeneous_train.py", line 282, in <module>
    dataset = IGBHeteroDGLDatasetMassive(args)
  File "/root/wayne/gids/GIDS/evaluation/dataloader.py", line 377, in __init__
    super().__init__(name='IGB260M')
  File "/usr/local/lib/python3.8/dist-packages/dgl/data/dgl_dataset.py", line 112, in __init__
    self._load()
  File "/usr/local/lib/python3.8/dist-packages/dgl/data/dgl_dataset.py", line 203, in _load
    self.process()
  File "/root/wayne/gids/GIDS/evaluation/dataloader.py", line 381, in process
    paper_paper_edges = torch.from_numpy(np.load(osp.join(self.dir, self.args.dataset_size, 'processed', 
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py", line 405, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/nvme1n1/full/processed/paper__cites__paper/edge_index.npy'
gaowayne commented 1 month ago

@WWWzq-01 buddy

where I can get this pr_full.pt file?

root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./test1.sh
GIDS DataLoader Setting
GIDS:  True
CPU Feature Buffer:  True
Window Buffering:  True
Storage Access Accumulator:  True
Dataset: IGB
SSD are not assigned
ssd list:  None
SSD index: 0
SQs: 255        CQs: 255        n_qps: 128
Ctrl sizes: 1
n pages: 1048576
page size: 4096
num elements: 563200000000
n_ranges_bits: 6
n_ranges_mask: 63
pages_dma: 0x7fb238010000       220020410000
HEREN
Cond1
100000 8 1 100000
Finish Making Page Cache
Number of required storage accesses:  854.0499999999993
Traceback (most recent call last):
  File "heterogeneous_train.py", line 312, in <module>
    track_acc_GIDS(g, category, args, device, labels, key_offset)
  File "heterogeneous_train.py", line 68, in track_acc_GIDS
    pr_ten = torch.load(args.pin_file)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 1065, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 468, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 449, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/nvme1n1/pr_full.pt'
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# 
gaowayne commented 1 month ago

hello, I found I can use page_rank_node_list_gen.py to create pr_full.pt test. then I run out below result

I have two questions:

  1. why my output does not have sampling time and feature aggregation time, only train time and e2e time?
  2. why we cannot show the BaM effective BW and effective IOPS?
root@salab-hpedl380g11-01:~/wayne/gids/GIDS/evaluation# ./test1.sh
GIDS DataLoader Setting
GIDS:  True
CPU Feature Buffer:  True
Window Buffering:  True
Storage Access Accumulator:  True
Dataset: IGB
SSD are not assigned
ssd list:  None
SSD index: 0
SQs: 255        CQs: 255        n_qps: 128
Ctrl sizes: 1
n pages: 1048576
page size: 4096
num elements: 563200000000
n_ranges_bits: 6
n_ranges_mask: 63
pages_dma: 0x7ef768010000       220020410000
HEREN
Cond1
100000 8 1 100000
Finish Making Page Cache
Number of required storage accesses:  854.0499999999993
  0%|                                                                                                                                                                | 0/1 [00:00<?, ?it/s]
warp up done
GIDS time:  35.0621292591095
WB time:  0.11368942260742188
print stats: 
print array reset: #READ IOs: 0 #Accesses:1318947840    #Misses:1024407136      Miss Rate:0.776685      #Hits: 294540704        Hit Rate:0.223315       CLSize:4096     Debug Cnt: 0
*********************************

print ctrl reset 0: ------------------------------------
#SSDAccesses:   32012723

Kernel Time:     28573.6
Total Access:    175142339
Performance for 100 iteration after 1000 iteration
GIDS time:  3.439724922180176
WB time:  0.011293411254882812
print stats: 
print array reset: #READ IOs: 0 #Accesses:115118784     #Misses:85275584        Miss Rate:0.740762      #Hits: 29843200 Hit Rate:0.259238       CLSize:4096     Debug Cnt: 0
*********************************

print ctrl reset 0: ------------------------------------
#SSDAccesses:   2664862

Kernel Time:     2847.87
Total Access:    17468327
transfer time:  0.04842567443847656
train time:  0.7668819427490234
e2e time:  4.265716314315796
  0%|                                                                                                                                                                | 0/1 [00:47<?, ?it/s
gaowayne commented 1 month ago

hello, also where I can get below dataset extend files.

        elif self.size == 'large' or self.size == 'full':
            num_nodes = self.num_nodes()
            if self.num_classes == 19:
                _**path = '/mnt/nvme16/IGB260M_part_2/full/processed/paper/node_label_19_extended.npy'**_
                if(self.in_memory):
                    node_labels = np.memmap(path, dtype='float32', mode='r',  shape=(num_nodes)).copy()
                else:
                    node_labels = np.memmap(path, dtype='float32', mode='r',  shape=(num_nodes))
                # Actual number 227130858
            else:
                **_path = '/mnt/nvme16/IGB260M_part_2/full/processed/paper/node_label_2K_extended.npy'_**

                if(self.in_memory):
                    node_labels = np.load(path)
                else:
                    node_labels = np.memmap(path, dtype='float32', mode='r',  shape=(num_nodes))
jeongminpark417 commented 1 month ago

Hi @gaowayne, sorry for late response. The dataset can be downloade from IGB dataset https://github.com/IllinoisGraphBenchmark/IGB-Datasets. The feature aggregation time is the Kernel Time (ms).

It currently does not directly show BaM bandwidth and IOPs, but you can simply calculate that with the number of accesses and the kernel time

gaowayne commented 1 month ago

Hi @gaowayne, sorry for late response. The dataset can be downloade from IGB dataset https://github.com/IllinoisGraphBenchmark/IGB-Datasets. The feature aggregation time is the Kernel Time (ms).

It currently does not directly show BaM bandwidth and IOPs, but you can simply calculate that with the number of accesses and the kernel time

for BW and IOPS, I saw there is code BaM.fsstat, maybe we can calc it there?

thank you so much. how about sampling time? :) and also, I just download from this IGB-Datasets. download full package from bash scripts as mentioned. but it does not contain any extend files as our GIDS dataloader.py are looking for in above code. node_label_19_extended.npy ----> current full package only have no extended file. could you please shed some light? why current code looking for extended file.

root@salab-hpedl380g11-01:/mnt/nvme1n1/full/processed# tree
.
├── paper
│   ├── node_feat.npy
│   ├── node_label_19.npy
│   ├── node_label_2K.npy
│   └── paper_id_index_mapping.npy
└── paper__cites__paper
    └── edge_index.npy
gaowayne commented 1 month ago

guys, I also do not have below files.

#        self.graph = dgl.graph((node_edges[:, 0],node_edges[:, 1]), num_nodes=node_features.shape[0])
        if self.args.dataset_size == 'full':
            edge_row_idx = torch.from_numpy(np.load(cur_path + '/paper__cites__paper/edge_index_csc_row_idx.npy'))
            edge_col_idx = torch.from_numpy(np.load(cur_path + '/paper__cites__paper/edge_index_csc_col_idx.npy'))
            edge_idx = torch.from_numpy(np.load(cur_path + '/paper__cites__paper/edge_index_csc_edge_idx.npy'))
            self.graph = dgl.graph(('csc', (edge_col_idx,edge_row_idx,edge_idx)), num_nodes=node_features.shape[0])
            self.graph  = self.graph.formats('csc')
        else:
gaowayne commented 1 month ago

guys, I found another way to try. I will ./download_igbh600m.sh then try to run IGBHeteroDGLDatasetMassive, looks like all match. :)

jeongminpark417 commented 1 month ago

The sampling time should be almost identical if you subtract the feature aggression time and training time to epoch time. The more accurate sampling time can be measured by timing next(it) in GIDS.py file.

gaowayne commented 1 month ago

The sampling time should be almost identical if you subtract the feature aggression time and training time to epoch time. The more accurate sampling time can be measured by timing next(it) in GIDS.py file.

thank you so much man.

Kernel Time: 2848.38 -----------> this is featuring aggregation time at ms. Total Access: 17512908 transfer time: 0.04809975624084473 train time: 0.7011768817901611 --------> e2e time: 4.2234179973602295

may I know what is epoch time? :) sampling = epoch - kernel - train. I guess this is correct.

also, ./download_igbh600m.sh is very very big, I am downloading them take 1-2 day. but ./download_igbh260m.sh missing the files to run the full mode. :( can you shed some light on this too?

100000 8 1 100000
Finish Making Page Cache
Number of required storage accesses:  854.0499999999993
  0%|                                                                                                                                                                | 0/1 [00:00<?, ?it/s]warp up done
GIDS time:  35.200151681900024
WB time:  0.11409378051757812
print stats: 
print array reset: #READ IOs: 0 #Accesses:1319791808    #Misses:1025061280      Miss Rate:0.776684      #Hits: 294730528        Hit Rate:0.223316       CLSize:4096     Debug Cnt: 0
*********************************

print ctrl reset 0: ------------------------------------
#SSDAccesses:   32033165

Kernel Time:     28529.1
Total Access:    175257719
Performance for 100 iteration after 1000 iteration
GIDS time:  3.4651525020599365
WB time:  0.011265754699707031
print stats: 
print array reset: #READ IOs: 0 #Accesses:115226464     #Misses:85321312        Miss Rate:0.740466      #Hits: 29905152 Hit Rate:0.259534       CLSize:4096     Debug Cnt: 0
*********************************

print ctrl reset 0: ------------------------------------
#SSDAccesses:   2666291

Kernel Time:     2848.38
Total Access:    17512908
transfer time:  0.04809975624084473
train time:  0.7011768817901611
e2e time:  4.2234179973602295
gaowayne commented 1 month ago

@jeongminpark417 @WWWzq-01 guys if I would like dump Effective Bandwidth and IOPS of GIDS Training vs Baseline, how to do that?