Open pvelesko opened 2 months ago
@pvelesko - what is the full output when you run a single, failing test (e.g., Kokkos_CoreUnitTest_Serial1
) ?
What do the results look like if you explicitly provide the target architecture?
@pvelesko - what is the full output when you run a single, failing test (e.g., Kokkos_CoreUnitTest_Serial1) ?
I provided the full log output in the original post.
What do the results look like if you explicitly provide the target architecture?
With -DKokkos_ARCH_INTEL_GEN=ON
- This flag seems to help a lot but I would have assumed that JIT is default. Not sure what Kokkos tries to do when this flag is not explicitly specified?
83% tests passed, 9 tests failed out of 52
Total Test time (real) = 628.25 sec
The following tests FAILED:
1 - Kokkos_CoreUnitTest_Serial1 (Failed)
4 - Kokkos_CoreUnitTest_SYCL1A (Subprocess aborted)
5 - Kokkos_CoreUnitTest_SYCL1B (Failed)
6 - Kokkos_CoreUnitTest_SYCL2A (Failed)
10 - Kokkos_CoreUnitTest_SYCL3 (Failed)
14 - Kokkos_CoreUnitTest_Default (Timeout)
29 - Kokkos_CoreUnitTest_DeviceAndThreads (Failed)
31 - Kokkos_ContainersUnitTest_SYCL (Subprocess killed)
47 - Kokkos_AlgorithmsUnitTest_StdSet_Team_I (Failed)
LastTest-DDKokkos_ARCH_INTEL_GEN.txt
Tried Kokkos_ARCH_INTEL_XEHP
but it failed to compile
Could not determine device target: 12.50.4.
Error: Cannot get HW Info for device 12.50.4.
A770 is Xe HPG - not sure if that was correct. https://www.intel.com/content/www/us/en/products/sku/229151/intel-arc-a770-graphics-16gb/specifications.html
Using Kokkos_ARCH_INTEL_DG1
was compiling but after multiple hours it seems to have stopped making progress.. or is taking forever.
Build succeeded.
[778/779] Linking CXX executable core/unit_test/Kokkos_CoreUnitTest_SYCL1A
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
Compilation from IR - skipping loading of FCL
Build succeeded.
With
-DKokkos_ARCH_INTEL_GEN=ON
- This flag seems to help a lot but I would have assumed that JIT is default. Not sure what Kokkos tries to do when this flag is not explicitly specified?
It's not passing a flag which should be equivalent to requesting JIT compilation (where the latter is explicit about compiling to SPIR-V).
Tried Kokkos_ARCH_INTEL_XEHP but it failed to compile
We might need to update the flags then. The ones we are using were recommended on the testbeds some time ago but I guess the AOT compiler can now handle named options. For reference, the A750 seems to be a dg2(-g12?) in the xe-hpg family in ocloc speak, see https://github.com/intel/compute-runtime/blob/014720fc29c59432188a49bebe1aec5aecb5d4f0/shared/source/dll/devices/devices_base.inl#L56.
The list of failing tests is:
[ FAILED ] serial.atomic_operations_double (0 ms) [ FAILED ] serial.atomic_operations_float (0 ms) [ FAILED ] 2 tests, listed below: [ FAILED ] serial.atomic_operations_double [ FAILED ] serial.atomic_operations_float 2 FAILED TESTS Error regular expression found in output. Regex=[ FAILED ] [ FAILED ] sycl.atomic_operations_complexdouble (2481 ms) [ FAILED ] sycl.atomic_operations_double (2344 ms) [ FAILED ] sycl.atomic_operations_float (3 ms) Error regular expression found in output. Regex=[ FAILED ] [ FAILED ] sycl_host_usm.view_allocation_large_rank (0 ms) [ FAILED ] 1 test, listed below: [ FAILED ] sycl_host_usm.view_allocation_large_rank 1 FAILED TEST Error regular expression found in output. Regex=[ FAILED ] FAILED (errors=1, skipped=1) [ FAILED ] sycl.reducers_int8_t (9 ms) [ FAILED ] sycl.reducers_point_t (6 ms) [ FAILED ] sycl.reducers_bool (1052 ms) [ FAILED ] sycl.int_combined_reduce (1924 ms) [ FAILED ] sycl.mdrange_combined_reduce (0 ms) [ FAILED ] sycl.int_combined_reduce_mixed (0 ms) [ FAILED ] sycl.reduction_deduction (0 ms) [ FAILED ] sycl.reduce_device_view_range_policy (7283 ms) [ FAILED ] sycl.reduce_device_view_mdrange_policy (2361 ms) [ FAILED ] sycl.reduce_device_view_team_policy (2268 ms) [ FAILED ] 10 tests, listed below: [ FAILED ] sycl.reducers_int8_t [ FAILED ] sycl.reducers_point_t [ FAILED ] sycl.reducers_bool [ FAILED ] sycl.int_combined_reduce [ FAILED ] sycl.mdrange_combined_reduce [ FAILED ] sycl.int_combined_reduce_mixed [ FAILED ] sycl.reduction_deduction [ FAILED ] sycl.reduce_device_view_range_policy [ FAILED ] sycl.reduce_device_view_mdrange_policy [ FAILED ] sycl.reduce_device_view_team_policy 10 FAILED TESTS Error regular expression found in output. Regex=[ FAILED ] [ FAILED ] sycl.TeamThreadMDRangeParallelReduce (21 ms) [ FAILED ] sycl.ThreadVectorMDRangeParallelReduce (13 ms) [ FAILED ] sycl.TeamVectorMDRangeParallelReduce (13 ms) [ FAILED ] sycl.multi_level_scratch (3504 ms) FAILED teamvector_parallel_reduce 0 0 54103.000000 0.000000 24 FAILED teamvector_parallel_reduce with shared result 0 0 54103.000000 0.000000 24 [ FAILED ] sycl.team_teamvector_range (3175 ms) [ FAILED ] sycl.view_allocation_large_rank (0 ms) [ FAILED ] 6 tests, listed below: [ FAILED ] sycl.TeamThreadMDRangeParallelReduce [ FAILED ] sycl.ThreadVectorMDRangeParallelReduce [ FAILED ] sycl.TeamVectorMDRangeParallelReduce [ FAILED ] sycl.multi_level_scratch [ FAILED ] sycl.team_teamvector_range [ FAILED ] sycl.view_allocation_large_rank 6 FAILED TESTS Error regular expression found in output. Regex=[ FAILED ] [ FAILED ] std_algorithms_reduce_team_test.test (5732 ms) [ FAILED ] std_algorithms_transform_reduce_team_test.test (5333 ms) [ FAILED ] 2 tests, listed below: [ FAILED ] std_algorithms_reduce_team_test.test [ FAILED ] std_algorithms_transform_reduce_team_test.test 2 FAILED TESTS Error regular expression found in output. Regex=[ FAILED ]
So there seems to be some problems with shuffles and device_global variables. It's hard to look into it without access to that architecture, though.
So there seems to be some problems with shuffles and device_global variables. It's hard to look into it without access to that architecture, though.
It's my personal server, I can set you up with ssh if you'd like.
Also, I ran these tests on an iGPU which is available on the same system:
85% tests passed, 8 tests failed out of 52
Total Test time (real) = 731.19 sec
The following tests FAILED:
1 - Kokkos_CoreUnitTest_Serial1 (Failed)
4 - Kokkos_CoreUnitTest_SYCL1A (Subprocess aborted)
6 - Kokkos_CoreUnitTest_SYCL2A (Failed)
10 - Kokkos_CoreUnitTest_SYCL3 (Failed)
14 - Kokkos_CoreUnitTest_Default (Timeout)
29 - Kokkos_CoreUnitTest_DeviceAndThreads (Failed)
38 - Kokkos_AlgorithmsUnitTest_StdSet_E (Subprocess aborted)
47 - Kokkos_AlgorithmsUnitTest_StdSet_Team_I (Subprocess aborted)
@ajpowelsnl
Kokkos 4.1.00 + oneapi/2023.2.4 on Intel(R) UHD Graphics 770
75% tests passed, 10 tests failed out of 40
Total Test time (real) = 619.60 sec
The following tests FAILED:
4 - Kokkos_CoreUnitTest_SYCL1A (Failed)
6 - Kokkos_CoreUnitTest_SYCL2A (Timeout)
10 - Kokkos_CoreUnitTest_SYCL3 (Failed)
11 - Kokkos_CoreUnitTest_SYCLInterOpInit (Failed)
12 - Kokkos_CoreUnitTest_SYCLInterOpInit_Context (Failed)
13 - Kokkos_CoreUnitTest_SYCLInterOpStreams (Failed)
14 - Kokkos_CoreUnitTest_Default (Timeout)
32 - Kokkos_ContainersUnitTest_SYCL (Timeout)
33 - Kokkos_UnitTest_Sort (Timeout)
39 - Kokkos_AlgorithmsUnitTest_StdSet_E (Subprocess aborted)
Does anyone need access to the machine for debugging?
@pvelesko -- were you able to sign up for the Kokkos Slack Channel? We have attempted to contact you there to address your HIP and SYCL issues.
@ajpowelsnl yes, I'm on a thread
I see lots of timeouts and failure to find kernel
Please include the following for a minimal reproducer
Compilers (with versions) OneAPI 2024.1 icpx
Kokkos release or commit used (i.e., the sha1 number) tag 4.3.0
Platform, architecture and backend Intel A770 Discrete GPU
CMake configure command
rm -rf ${KOKKOS_DIR}/build && mkdir -p ${KOKKOS_DIR}/build && cd ${KOKKOS_DIR}/build && rm -f CMakeCache.txt git checkout HEAD -f && git checkout ${KOKKOS_VER} cmake -DKokkos_ENABLE_SYCL=ON \ -DCMAKE_CXX_COMPILER=icpx \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DKokkos_ENABLE_TESTS=ON \ -DCMAKE_INSTALL_PREFIX=${PREFIX} .. ninja install
─pvelesko@cupcake ~/kokkos-build/kokkos/build ‹4.3.00●› ╰─$ export KOKKOS_DIR=~/kokkos-build/kokkos export KOKKOS_KERNELS_DIR=~/kokkos-build/kokkos-kernels export KOKKOS_VER=4.3.00 export ONEAPI_VER=2024.1.0 export PREFIX=/space/pvelesko/install/kokkos/${KOKKOS_VER}/oneapi/$ONEAPI_VER module purge module load oneapi/$ONEAPI_VER
rm -rf ${KOKKOS_DIR}/build && mkdir -p ${KOKKOS_DIR}/build && cd ${KOKKOS_DIR}/build && rm -f CMakeCache.txt git checkout HEAD -f && git checkout ${KOKKOS_VER} cmake -DKokkos_ENABLE_SYCL=ON \ -DCMAKE_CXX_COMPILER=icpx \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DKokkos_ENABLE_TESTS=ON \ -DCMAKE_INSTALL_PREFIX=${PREFIX} .. Loading oneapi/2024.1.0 Loading requirement: opencl/ocl-icd-loader HEAD is now at 486cc745c Merge pull request #6908 from ndellingwood/master-release-4.3.00 -- Setting default Kokkos CXX standard to 17 -- The CXX compiler identification is IntelLLVM 2024.1.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /home/pvelesko/miniconda3/envs/oneapi-2024.1.0/bin/icpx - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Kokkos version: 4.3.0 -- The project name is: Kokkos -- Using gtest found in /usr/lib/x86_64-linux-gnu/cmake/GTest -- Configured git information in /home/pvelesko/kokkos-build/kokkos/build/generated/Kokkos_Version_Info.cpp -- SERIAL backend is being turned on to ensure there is at least one Host space. To change this, you must enable another host execution space and configure with -DKokkos_ENABLE_SERIAL=OFF or change CMakeCache.txt -- Using -std=gnu++17 for C++17 extensions as feature -- Looking for SYCL_EXT_ONEAPI_DEVICE_GLOBAL -- Looking for SYCL_EXT_ONEAPI_DEVICE_GLOBAL - found -- Built-in Execution Spaces: -- Device Parallel: Kokkos::Experimental::SYCL -- Host Parallel: NoTypeDefined -- Host Serial: SERIAL
-- Architectures: -- Found TPLLIBDL: /usr/include -- Looking for C++ include oneapi/dpl/execution -- Looking for C++ include oneapi/dpl/execution - found -- Looking for C++ include oneapi/dpl/algorithm -- Looking for C++ include oneapi/dpl/algorithm - found -- Performing Test KOKKOS_NO_TBB_CONFLICT -- Performing Test KOKKOS_NO_TBB_CONFLICT - Success -- Using internal desul_atomics copy -- Found Python3: /usr/bin/python3.10 (found version "3.10.12") found components: Interpreter -- Kokkos Backends: SERIAL;SYCL -- Configuring done -- Generating done -- Build files have been written to: /home/pvelesko/kokkos-build/kokkos/build