python / cpython

The Python programming language
https://www.python.org/
Other
61.21k stars 29.53k forks source link

riscv64 fails to build Python/perf_jit_trampoline.c: Unsupported target architecture #121201

Open vstinner opened 2 weeks ago

vstinner commented 2 weeks ago

build: https://buildbot.python.org/all/#/builders/1379/builds/625

gcc -c -fno-strict-overflow -fstack-protector-strong -Wtrampolines -Wsign-compare -DNDEBUG -g -O3 -Wall    -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc  -I. -I./Include    -DPy_BUILD_CORE -o Python/perf_trampoline.o Python/perf_trampoline.c
gcc -c -fno-strict-overflow -fstack-protector-strong -Wtrampolines -Wsign-compare -DNDEBUG -g -O3 -Wall    -std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal -I./Include/internal/mimalloc  -I. -I./Include    -DPy_BUILD_CORE -o Python/perf_jit_trampoline.o Python/perf_jit_trampoline.c
Python/perf_jit_trampoline.c:375:6: error: #error "Unsupported target architecture"
  375 | #    error "Unsupported target architecture"
      |      ^~~~~

cc @pablogsal

Linked PRs

vstinner commented 2 weeks ago

cc @furkanonder

vstinner commented 2 weeks ago

configure says:

checking for the platform triplet based on compiler characteristics... riscv64-linux-gnu

checking perf trampoline... yes
pablogsal commented 2 weeks ago

Hummm, looks like we are missing the definitions for the registers for riscv64 here:

https://github.com/python/cpython/blob/af8c3d7a26d605099f5b3406a8d33ecddb77e8fb/Python/perf_jit_trampoline.c#L352-L376

@furkanonder can you take a look? Otherwise we may need to deactivate RISKV64 support meanwhile we figure out the DWARF definitions.

pablogsal commented 2 weeks ago

I think it may be enough to add riskv here: https://github.com/python/cpython/blob/af8c3d7a26d605099f5b3406a8d33ecddb77e8fb/Python/perf_jit_trampoline.c#L371

but I am not sure about the numbers. It seems that they match the aarch64 but I would need a riskv machine to try out.

pablogsal commented 2 weeks ago

I way to try out the numbers is to generate DWARF in riskv for the same function and check the numeric values of DWRF_REG_SP and DWRF_REG_RA

pablogsal commented 2 weeks ago

Deactivated for now in #https://github.com/python/cpython/pull/121328

terryjreedy commented 2 weeks ago

The backport automerge failed and is still open.

furkanonder commented 2 weeks ago

I think it may be enough to add riskv here:

https://github.com/python/cpython/blob/af8c3d7a26d605099f5b3406a8d33ecddb77e8fb/Python/perf_jit_trampoline.c#L371

but I am not sure about the numbers. It seems that they match the aarch64 but I would need a riskv machine to try out.

According to Table 18.2: RISC-V calling convention register usage. looks like this;

Register ABI Name Description Saver
x1 ra Return address Caller
x2 sp Stack pointer Callee

Therefore, I set DWRF_REG_RA to 1 and DWRF_REG_SP to 2.

$ git diff
diff --git a/Python/perf_jit_trampoline.c b/Python/perf_jit_trampoline.c
index 0a8945958b..6e30ed2865 100644
--- a/Python/perf_jit_trampoline.c
+++ b/Python/perf_jit_trampoline.c
@@ -371,6 +371,9 @@ enum {
 #elif defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
     DWRF_REG_SP = 31,
     DWRF_REG_RA = 30,
+#elif defined(__riscv)
+    DWRF_REG_RA = 1,
+    DWRF_REG_SP = 2,
 #else
 #    error "Unsupported target architecture"
 #endif

I got an another error here, I have no idea about the extra registers.

@@ -477,7 +480,7 @@ elf_init_ehframe(ELFObjectContext* ctx)
                  DWRF_U8(DWRF_CFA_advance_loc | 6);
                  DWRF_U8(DWRF_CFA_def_cfa_offset); DWRF_UV(8);
     /* Extra registers saved for JIT-compiled code. */
-#elif defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
+#elif (defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)) || defined(__riscv)
                  DWRF_U8(DWRF_CFA_advance_loc | 1);
                  DWRF_U8(DWRF_CFA_def_cfa_offset); DWRF_UV(16);
                  DWRF_U8(DWRF_CFA_offset | 29); DWRF_UV(2);

Following these changes, the build was completed successfully. I didn't encounter any failed test cases.

 ./python -X perf_jit  -Wdefault -bb -E -m test -rwW -uall -j2 --timeout=2400 -j4
== Tests result: SUCCESS ==

14 tests skipped:
    test.test_asyncio.test_windows_events
    test.test_asyncio.test_windows_utils test_android test_devpoll
    test_free_threading test_kqueue test_launcher test_msvcrt
    test_startfile test_winapi test_winconsoleio test_winreg
    test_winsound test_wmi

3 tests skipped (resource denied):
    test_tkinter test_ttk test_zipfile64

461 tests OK.

Total duration: 49 min 31 sec
Total tests: run=44,299 skipped=1,730
Total test files: run=475/478 skipped=14 resource_denied=3
Result: SUCCESS
pablogsal commented 2 weeks ago

@furkanonder Can you create another PR with the previous changes and the changes you just did? We can test that against the builedbot then.

Also, please, can you show me the output of python -m test test_perf_profiler -v ?

furkanonder commented 2 weeks ago

@furkanonder Can you create another PR with the previous changes and the changes you just did? We can test that against the builedbot then.

Also, please, can you show me the output of python -m test test_perf_profiler -v ?

PR for the changes.

$ ./python -m test test_perf_profiler -v

Output:

== CPython 3.14.0a0 (heads/main-dirty:1dc9a4f6b2, Jul 2 2024, 23:29:42) [GCC 13.2.0]
== Linux-6.1.81-riscv64-with-glibc2.38 little-endian
== Python build: release
== cwd: /home/dietpi/desktop/cpython/build/test_python_worker_53029æ
== CPU count: 4
== encodings: locale=UTF-8 FS=utf-8
== resources: all test resources are disabled, use -u option to unskip tests

Using random seed: 3537377140
0:00:00 load avg: 4.39 Run 1 test sequentially in a single process
0:00:00 load avg: 4.39 [1/1] test_perf_profiler
test_pre_fork_compile (test.test_perf_profiler.TestPerfProfiler.test_pre_fork_compile) ... skipped "perf command doesn't work"
test_python_calls_appear_in_the_stack_if_perf_activated (test.test_perf_profiler.TestPerfProfiler.test_python_calls_appear_in_the_stack_if_perf_activated) ... skipped "perf command doesn't work"
test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated (test.test_perf_profiler.TestPerfProfiler.test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated) ... skipped "perf command doesn't work"
test_python_calls_appear_in_the_stack_if_perf_activated (test.test_perf_profiler.TestPerfProfilerWithDwarf.test_python_calls_appear_in_the_stack_if_perf_activated) ... skipped "perf command doesn't work"
test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated (test.test_perf_profiler.TestPerfProfilerWithDwarf.test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated) ... skipped "perf command doesn't work"
test_sys_api (test.test_perf_profiler.TestPerfTrampoline.test_sys_api) ... ok
test_sys_api_get_status (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_get_status) ... ok
test_sys_api_with_existing_trampoline (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_with_existing_trampoline) ... ok
test_sys_api_with_invalid_trampoline (test.test_perf_profiler.TestPerfTrampoline.test_sys_api_with_invalid_trampoline) ... ok
test_trampoline_works (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works) ... ok
test_trampoline_works_with_forks (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works_with_forks) ... ok

----------------------------------------------------------------------
Ran 11 tests in 1.052s

OK (skipped=5)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2.4 sec
Total tests: run=11 skipped=5
Total test files: run=1/1
Result: SUCCESS

I followed Python support for the Linux perf profiler to test perf profiling. I think the perf tool is not well supported in my buildbot.

dietpi@DietPi:~/desktop/cpython$ cat my_script.py
def foo(n):
    result = 0
    for _ in range(n):
        result += 1
    return result

def bar(n):
    foo(n)

def baz(n):
    bar(n)

if __name__ == "__main__":
    baz(1000000)
dietpi@DietPi:~/desktop/cpython$ perf record -F 9999 -g -o perf.data -a ./python my_script.py
Error:
cycles:P: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
dietpi@DietPi:~/desktop/cpython$
dietpi@DietPi:~/desktop/cpython$ perf stat -e cycles -o perf.data ./python my_script.py
dietpi@DietPi:~/desktop/cpython$ cat perf.data
# started on Fri Jul  5 00:40:49 2024

 Performance counter stats for './python my_script.py':

         742098135      cycles

       0.547787000 seconds time elapsed

       0.485775000 seconds user
       0.010120000 seconds sys

dietpi@DietPi:~/desktop/cpython$
$ perf list
List of pre-defined events (to be used in -e or -M):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  cgroup-switches                                    [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
  duration_time                                      [Tool event]
  user_time                                          [Tool event]
  system_time                                        [Tool event]

cpu:
  L1-dcache-loads OR cpu/L1-dcache-loads/
  L1-dcache-load-misses OR cpu/L1-dcache-load-misses/
  L1-dcache-stores OR cpu/L1-dcache-stores/
  L1-dcache-store-misses OR cpu/L1-dcache-store-misses/
  L1-dcache-prefetches OR cpu/L1-dcache-prefetches/
  L1-dcache-prefetch-misses OR cpu/L1-dcache-prefetch-misses/
  L1-icache-loads OR cpu/L1-icache-loads/
  L1-icache-load-misses OR cpu/L1-icache-load-misses/
  L1-icache-prefetches OR cpu/L1-icache-prefetches/
  L1-icache-prefetch-misses OR cpu/L1-icache-prefetch-misses/
  LLC-loads OR cpu/LLC-loads/
  LLC-load-misses OR cpu/LLC-load-misses/
  LLC-stores OR cpu/LLC-stores/
  LLC-store-misses OR cpu/LLC-store-misses/
  LLC-prefetches OR cpu/LLC-prefetches/
  LLC-prefetch-misses OR cpu/LLC-prefetch-misses/
  dTLB-loads OR cpu/dTLB-loads/
  dTLB-load-misses OR cpu/dTLB-load-misses/
  dTLB-stores OR cpu/dTLB-stores/
  dTLB-store-misses OR cpu/dTLB-store-misses/
  dTLB-prefetches OR cpu/dTLB-prefetches/
  dTLB-prefetch-misses OR cpu/dTLB-prefetch-misses/
  iTLB-loads OR cpu/iTLB-loads/
  iTLB-load-misses OR cpu/iTLB-load-misses/
  branch-loads OR cpu/branch-loads/
  branch-load-misses OR cpu/branch-load-misses/
  node-loads OR cpu/node-loads/
  node-load-misses OR cpu/node-load-misses/
  node-stores OR cpu/node-stores/
  node-store-misses OR cpu/node-store-misses/
  node-prefetches OR cpu/node-prefetches/
  node-prefetch-misses OR cpu/node-prefetch-misses/
  (null)                                             [Kernel PMU event]

firmware:
  fw_access_load
       [Load access trap event]
  fw_access_store
       [Store access trap event]
  fw_fence_i_received
       [Received FENCE.I request from other HART event]
  fw_fence_i_sent
       [Sent FENCE.I request to other HART event]
  fw_hfence_gvma_received
       [Received HFENCE.GVMA request from other HART event]
  fw_hfence_gvma_sent
       [Sent HFENCE.GVMA request to other HART event]
  fw_hfence_gvma_vmid_received
       [Received HFENCE.GVMA with VMID request from other HART event]
  fw_hfence_gvma_vmid_sent
       [Sent HFENCE.GVMA with VMID request to other HART event]
  fw_hfence_vvma_asid_received
       [Received HFENCE.VVMA with ASID request from other HART event]
  fw_hfence_vvma_asid_sent
       [Sent HFENCE.VVMA with ASID request to other HART event]
  fwError: failed to open tracing events directory
_hfence_vvma_received
       [Received HFENCE.VVMA request from other HART event]
  fw_hfence_vvma_sent
       [Sent HFENCE.VVMA request to other HART event]
  fw_illegal_insn
       [Illegal instruction trap event]
  fw_ipi_received
       [Received IPI from other HART event]
  fw_ipi_sent
       [Sent IPI to other HART event]
  fw_misaligned_load
       [Misaligned load trap event]
  fw_misaligned_store
       [Misaligned store trap event]
  fw_set_timer
       [Set timer event]
  fw_sfence_vma_asid_received
       [Received SFENCE.VMA with ASID request from other HART event]
  fw_sfence_vma_received
       [Sent SFENCE.VMA with ASID request to other HART event]
  fw_sfence_vma_sent
       [Sent SFENCE.VMA request to other HART event]

instructions:
  atomic_memory_retired
       [Atomic memory operation retired]
  conditional_branch_retired
       [Conditional branch retired]
  exception_taken
       [Exception taken]
  fp_addition_retired
       [Floating-point addition retired]
  fp_div_sqrt_retired
       [Floating-point division or square-root retired]
  fp_fusedmadd_retired
       [Floating-point fused multiply-add retired]
  fp_load_retired
       [Floating-point load instruction retired]
  fp_multiplication_retired
       [Floating-point multiplication retired]
  fp_store_retired
       [Floating-point store instruction retired]
  integer_arithmetic_retired
       [Integer arithmetic instruction retired]
  integer_division_retired
       [Integer division instruction retired]
  integer_load_retired
       [Integer load instruction retired]
  integer_multiplication_retired
       [Integer multiplication instruction retired]
  integer_store_retired
       [Integer store instruction retired]
  jal_instruction_retired
       [JAL instruction retired]
  jalr_instruction_retired
       [JALR instruction retired]
  other_fp_retired
       [Other floating-point instruction retired]
  system_instruction_retired
       [System instruction retired]

memory:
  data_tlb_miss
       [Data TLB miss]
  dcache_miss_mmio_accesses
       [Data cache miss or memory-mapped I/O access]
  dcache_writeback
       [Data cache write-back]
  icache_retired
       [Instruction cache miss]
  inst_tlb_miss
       [Instruction TLB miss]
  utlb_miss
       [UTLB miss]

microarch:
  addressgen_interlock
       [Address-generation interlock]
  branch_direction_misprediction
       [Branch direction misprediction]
  branch_target_misprediction
       [Branch/jump target misprediction]
  csr_read_interlock
       [CSR read interlock]
  dcache_dtim_busy
       [Data cache/DTIM busy]
  fp_interlock
       [Floating-point interlock]
  icache_itim_busy
       [Instruction cache/ITIM busy]
  integer_multiplication_interlock
  longlat_interlock
       [Long-latency interlock]
  pipe_flush_csr_write
       [Pipeline flush from CSR write]
  pipe_flush_other_event
       [Pipeline flush from other event]
  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
       [(see 'man perf-list' on how to encode it)]
  mem:<addr>[/len][:access]                          [Hardware breakpoint]
pablogsal commented 6 days ago

I followed Python support for the Linux perf profiler to test perf profiling. I think the perf tool is not well supported in my buildbot.

Unfortunately without testing in a system that has a working perf we won't be able to merge the PR because we cannot validate that it works. I am not confortable to merge these changes without being able to corroborate the functionality :(