Closed gezalore closed 4 years ago
Maybe try an experiment where the trace write() is commented out. If that gives you maybe 30% versus your 35% as I suspect (as it seems similar to what you tried since I think the formatting is a small overhead), then just using the asynchronous file IO (aio) with write() and double (or more) buffering the output will give us the 30% even without threads. Note not all OSes have aio so would need appropriate ifdefs.
Alternatively/in addition also try an "strace" with timing flags.
The numbers I quoted were with the file being written to a ramdisk. Here is some hard data based on what you suggested.
Measured with:
sudo perf stat -r 3 -dd taskset 1 chrt -f 80 ./obj_dir/Vtb_top +dumpon
Baseline (master as of 847c8d5d):
24,869.06 msec task-clock # 0.999 CPUs utilized ( +- 0.62% ) 14 context-switches # 0.001 K/sec ( +- 4.12% ) 1 cpu-migrations # 0.000 K/sec 2,670 page-faults # 0.107 K/sec ( +- 0.06% ) 96,306,094,647 cycles # 3.873 GHz ( +- 0.61% ) (30.76%) 62,148,608,870 stalled-cycles-frontend # 64.53% frontend cycles idle ( +- 0.57% ) (30.77%) 85,662,975,314 instructions # 0.89 insn per cycle # 0.73 stalled cycles per insn ( +- 0.78% ) (38.47%) 7,918,922,528 branches # 318.425 M/sec ( +- 1.56% ) (38.48%) 612,832,818 branch-misses # 7.74% of all branches ( +- 0.08% ) (38.49%) 20,044,500,112 L1-dcache-loads # 806.001 M/sec ( +- 0.86% ) (30.76%) 730,401,524 L1-dcache-load-misses # 3.64% of all L1-dcache hits ( +- 0.59% ) (15.38%) 227,196,348 LLC-loads # 9.136 M/sec ( +- 1.33% ) (15.37%) 5,047,936,497 L1-icache-load-misses ( +- 0.15% ) (23.06%) 20,061,540,635 dTLB-loads # 806.687 M/sec ( +- 0.94% ) (23.04%) 25,453,753 dTLB-load-misses # 0.13% of all dTLB cache hits ( +- 6.39% ) (15.38%) 144,126,990 iTLB-loads # 5.795 M/sec ( +- 1.33% ) (15.38%) 32,734,500 iTLB-load-misses # 22.71% of all iTLB cache hits ( +- 1.77% ) (23.07%) 24.897 +- 0.151 seconds time elapsed ( +- 0.61% )
With the write to the file here https://github.com/verilator/verilator/blob/dc5c259069f63b718ef3b7909724571c97beff53/include/verilated_vcd_c.cpp#L145 replaced with return len
:
23,112.02 msec task-clock # 0.998 CPUs utilized ( +- 0.43% ) 16 context-switches # 0.001 K/sec ( +- 5.40% ) 1 cpu-migrations # 0.000 K/sec 2,670 page-faults # 0.116 K/sec ( +- 0.02% ) 89,469,493,309 cycles # 3.871 GHz ( +- 0.40% ) (30.75%) 57,573,902,061 stalled-cycles-frontend # 64.35% frontend cycles idle ( +- 0.31% ) (30.76%) 80,988,426,756 instructions # 0.91 insn per cycle # 0.71 stalled cycles per insn ( +- 0.85% ) (38.46%) 7,107,220,713 branches # 307.512 M/sec ( +- 1.79% ) (38.47%) 580,847,111 branch-misses # 8.17% of all branches ( +- 0.23% ) (38.48%) 18,529,190,257 L1-dcache-loads # 801.712 M/sec ( +- 0.98% ) (30.75%) 529,066,400 L1-dcache-load-misses # 2.86% of all L1-dcache hits ( +- 0.54% ) (15.39%) 143,452,344 LLC-loads # 6.207 M/sec ( +- 1.41% ) (15.39%) 4,916,078,529 L1-icache-load-misses ( +- 0.18% ) (23.08%) 18,530,147,654 dTLB-loads # 801.754 M/sec ( +- 0.98% ) (23.06%) 18,727,452 dTLB-load-misses # 0.10% of all dTLB cache hits ( +- 6.68% ) (15.38%) 144,610,677 iTLB-loads # 6.257 M/sec ( +- 2.34% ) (15.37%) 27,244,562 iTLB-load-misses # 18.84% of all iTLB cache hits ( +- 15.59% ) (23.06%) 23.152 +- 0.107 seconds time elapsed ( +- 0.46% )
That is better than what strace says on the baseline:
strace -c ./obj_dir/Vtb_top +dumpon
...
% time seconds usecs/call calls errors syscall
70.52 0.350573 29214 12 3 openat
29.46 0.146466 1 99300 write
0.01 0.000037 1 60 brk
0.00 0.000012 1 18 mmap
0.00 0.000010 0 41 read
0.00 0.000008 1 10 fstat
0.00 0.000005 1 9 close
0.00 0.000003 3 1 lseek
0.00 0.000000 0 10 mprotect
0.00 0.000000 0 2 munmap
0.00 0.000000 0 6 6 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
100.00 0.497114 99471 9 total
With the hacked up write to buffer version:
17,603.76 msec task-clock # 0.997 CPUs utilized ( +- 0.26% ) 15 context-switches # 0.001 K/sec ( +- 5.75% ) 1 cpu-migrations # 0.000 K/sec 2,710 page-faults # 0.154 K/sec ( +- 0.01% ) 68,155,221,067 cycles # 3.872 GHz ( +- 0.25% ) (30.75%) 46,663,153,952 stalled-cycles-frontend # 68.47% frontend cycles idle ( +- 0.38% ) (30.76%) 55,180,537,125 instructions # 0.81 insn per cycle # 0.85 stalled cycles per insn ( +- 0.02% ) (38.46%) 3,265,472,754 branches # 185.499 M/sec ( +- 0.15% ) (38.47%) 103,437,751 branch-misses # 3.17% of all branches ( +- 2.17% ) (38.47%) 12,564,002,142 L1-dcache-loads # 713.711 M/sec ( +- 0.06% ) (30.73%) 766,630,856 L1-dcache-load-misses # 6.10% of all L1-dcache hits ( +- 0.06% ) (15.39%) 109,530,431 LLC-loads # 6.222 M/sec ( +- 0.97% ) (15.39%) 4,489,986,619 L1-icache-load-misses ( +- 0.22% ) (23.08%) 12,562,130,358 dTLB-loads # 713.605 M/sec ( +- 0.04% ) (23.05%) 13,845,278 dTLB-load-misses # 0.11% of all dTLB cache hits ( +- 3.63% ) (15.37%) 102,533,186 iTLB-loads # 5.825 M/sec ( +- 1.36% ) (15.37%) 17,419,785 iTLB-load-misses # 16.99% of all iTLB cache hits ( +- 8.26% ) (23.06%) 17.6532 +- 0.0447 seconds time elapsed ( +- 0.25% )
Note the instruction count difference. About 18% of the profile of the baseline is still inside VerilatedVcd::
, which is mostly the formatting, but the rest of the difference is not checking for value changes in the trace functions via chg*
which are inlined.
I'm fairly confident my hacked up version is accurate. Here is a function when compiled as in master:
00000000002bb00 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, 000000000002bb00 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, VerilatedVcd*, unsigned int)>:
Verila unsigned int)>:
2bb00: mov rax, [rdi+0x10]
2bb04: add edx, 0x20
2bb07: mov rcx, [rsi+0xc8]
2bb0e: mov eax, [rax+0x14bc]
2bb14: cmp [rcx+rdx*4], eax
2bb17: jne 2bb1a <Vtb_top::traceChgThis__4(Vtb_top__Syms*, Veril unsigned int)+0x1a>
2bb19: ret
2bb1a: lea rcx, [rcx+rdx*4]
2bb1e: mov rdi, rsi
2bb21: mov rsi, rcx
2bb24: mov edx, eax
2bb26: jmp 2bb2b <Vtb_top::traceChgThis__4(Vtb_top__Syms*, Veril unsigned int)+0x2b>
2bb27: R_X86_64_PC32void VerilatedVcd::fullBus<28>(unsigned int*, unsigned int)-0x4
And here is my version, note the 3 stores via rcx
to the hypothetical trace buffer, and the write back of the incremented pointer via rdx
. That's all that is run in the main thread, no tests, no writeback of new value, no formatting, no branches:
000000000003ab60 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, 000000000003ab60 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, VerilatedVcd*, unsigned int)>:
Verila unsigned int)>:
3ab60: mov rax, [rdi+0x10]
3ab64: add edx, 0x20
3ab67: shl rdx, 0x2
3ab6b: add rdx, [rsi+0xc8]
3ab72: mov rcx, [rsi+0x118]
3ab79: mov eax, [rax+0x14bc]
3ab7f: mov [rcx], 0x1
3ab86: mov [rcx+0x8], rdx
3ab8a: lea rdx, [rcx+0x18]
3ab8e: mov [rsi+0x118], rdx
3ab95: mov [rcx+0x10], rax
3ab99: ret
I agree AIO would help when storing to regular storage, or god forbid NFS, but to me it seems there is scope for much more.
Interesting. Aio is "free" in that's there's almost zero overhead, while anything needing a new thread will have some overhead on both the current and other thread, harming total throughput, plus bandwidth to move the numbers cross-CPUs. But if you are interested does seem worth exploring e.g. pushing an e.g. code and raw bytes to another thread that formats and calls AIO.
Here's the numbers I think matter, which I intuit usually measured in per thousand instructions (/ki).
64.53% frontend cycles idle
7.17/ki branch-misses # 8.17% of all branches
6.53/ki L1-dcache-load-misses # 2.86% of all L1-dcache hits
0.23/ki dTLB-load-misses # 0.10% of all dTLB cache hits
60.70/ki L1-icache-load-misses
0.33/ki iTLB-load-misses # 18.84% of all iTLB cache hits
1.77/ki LLC-loads
The frontend stalls are the real problem, 60/ki icache miss is crazy high. This is why the frontend is stalled so much. As we already knew the icache footprint is the real problem.
Actually.....
I wonder if a trace implementation that was totally data driven would be better? That is, instead of generated code calling the trace functions, generate a data table with pointers to the data. Then loop through and call the chg check on it. To avoid pointer following taking all of the time you'll need to VL_PREFETCH-ahead many entries, and VL_PREFETCH-ahead the pointers in the entries.
I think I agree with pretty much alll of the above. Threading does harm throughput in the sense that total work done over unit time at 100% system utilization is lower due to the threading overhead. I am mostly interested in the threading option for the day-use case when my one test have failed and I want to have a trace to debug sooner than later. In this case throughput is secondary to latency from a user perspective [at least around my end]. For nightly batch jobs, indeed just run single threaded for maximum throughput, though I rarely run those with tracing.
The data driven trace idea is interesting. One complication is that not all traced expressions are named and stored directly in the model, e.g.:
vcdp->chgBus(c+257,((7U & ((IData)(vlTOPp->tb_top__DOT__rvtop__DOT__lsu_freeze_dc3)
? (IData)(vlTOPp->tb_top__DOT__rvtop__DOT__mem__DOT__Gen_dccm_enable__DOT__dccm__DOT____Vcellout__rd_addr_lo_ff__dout)
: (vlTOPp->tb_top__DOT__rvtop__DOT__swerv__DOT__lsu__DOT__lsu_lsc_ctl__DOT__full_addr_dc1`
>> 2U)))),3);
I think these would still have to be traced as they are today, as the only alternative I see is computing them in eval and putting them in the symbol table as named values, which would hurt performance with tracing compiled in but turned off (I know that's not a preferred use case).
Agree that interactive use is a important case also.
For expressions in the tracing, these would presumably have to be calculated into temporary storage in the main trace thread, and then passed by pointer (to the temp) in the data structure.
OK, now I'm a bit confused again. I think we need to keep the trace threading and data driven tracing ideas separate (maybe a new issue for the latter?). I thought the idea with the data driven tracing was to have the main thread sweep the symbol table via the pointers (the same way as it does today with code). I don't think you can usefully offload the sweeping to a different thread, because the main thread at that point would have nothing to do (assuming only one verilated model exists in the execution) as it cannot modify the symbol table until the trace sweep is done. Thought that last assumption in the parentheses gives me another idea..
Data driven or not is orthogonal versus threaded or not, but the implementation of threading may differ greatly based on which way it's done, so I'd suggest figuring out if data driven is worth it by some experiments before trying threading.
As to data driven threading here's a possible flow:
main thread:
trace thread:
Note a read copy of the new value data has to move out of the main thread's cache to the trace thread's cache but isn't invalidated from the main thread nor evicted by other data (as it is now when tracing runs on the main threads).
In reality there doesn't need to be a single lock (A/B/C) instead there can be a lock for each mtask of data, so in theory if properly scheduled the main thread should have to wait very little.
@gezalore thanks for the awesome patches for this & think we can consider the performance even better than expected and so this is closable. (I'm presuming - if there's more here, that's fine too feel free to pull against this or reopen.)
While working on #2257, I had a couple of ideas for general multi-threaded tracing. Currently we support
--trace-fst-thread
but that relies on a feature of the FST API.One method for general threaded tracing would be to just copy the whole model state when you want to call
tfp->dump
, and run the tracing code in a different thread against the copy. I did an experiment sticking amemcpy
of the model next to every trace dump call on SweRV, and found that the simulation time increased by 4x, just because of the extra memcpy. I think the slowdown is due to the actual trace activity sets being very sparesly activated, so this sledge hammer method is hopeless.A different avenue would be to implement a new "tracing format", which implements the
full*
andchg*
tracing functions by writing a very simple entry into a linear trace buffer, containing the current value of the signal, an indication of the trace code, and a command descirbing the type and size. The verilator generated tracing fuctions only call these for signals whose activity set was treggered on that cycle. The trace buffer is then processed on a spearate thread. This means the checking of the new value against the old to detect change, and subsequently generating the trace file content does not happen in the simulation thread. To evaluate this, I replaced allchg*
functions inVerilatedVcd
with just the required writes into a buffer. This has shown a speedup of about 35%, compaed to tracing, so this strategy might be worth while. It is generally applicable to all trace formats, so FST would benefit as well, probably quite a bit more than VCD.Please let me know if you have any particular thoughts about this. I feel the hardest bit is actually getting the user interface/API and architecture right in a way that offloading to a separate thread is composable with trace formats, while both single and multi-threaded tracing is as performant as they can be.