General multi-threaded tracing

While working on #2257, I had a couple of ideas for general multi-threaded tracing. Currently we support --trace-fst-thread but that relies on a feature of the FST API.

One method for general threaded tracing would be to just copy the whole model state when you want to call tfp->dump, and run the tracing code in a different thread against the copy. I did an experiment sticking a memcpy of the model next to every trace dump call on SweRV, and found that the simulation time increased by 4x, just because of the extra memcpy. I think the slowdown is due to the actual trace activity sets being very sparesly activated, so this sledge hammer method is hopeless.

A different avenue would be to implement a new "tracing format", which implements the full* and chg* tracing functions by writing a very simple entry into a linear trace buffer, containing the current value of the signal, an indication of the trace code, and a command descirbing the type and size. The verilator generated tracing fuctions only call these for signals whose activity set was treggered on that cycle. The trace buffer is then processed on a spearate thread. This means the checking of the new value against the old to detect change, and subsequently generating the trace file content does not happen in the simulation thread. To evaluate this, I replaced all chg* functions in VerilatedVcd with just the required writes into a buffer. This has shown a speedup of about 35%, compaed to tracing, so this strategy might be worth while. It is generally applicable to all trace formats, so FST would benefit as well, probably quite a bit more than VCD.

Please let me know if you have any particular thoughts about this. I feel the hardest bit is actually getting the user interface/API and architecture right in a way that offloading to a separate thread is composable with trace formats, while both single and multi-threaded tracing is as performant as they can be.

Maybe try an experiment where the trace write() is commented out. If that gives you maybe 30% versus your 35% as I suspect (as it seems similar to what you tried since I think the formatting is a small overhead), then just using the asynchronous file IO (aio) with write() and double (or more) buffering the output will give us the 30% even without threads. Note not all OSes have aio so would need appropriate ifdefs.

Alternatively/in addition also try an "strace" with timing flags.

The numbers I quoted were with the file being written to a ramdisk. Here is some hard data based on what you suggested.

Measured with:

 sudo perf stat -r 3 -dd taskset 1 chrt -f 80 ./obj_dir/Vtb_top +dumpon

Baseline (master as of 847c8d5d):

     24,869.06 msec task-clock                #    0.999 CPUs utilized            ( +-  0.62% )
            14      context-switches          #    0.001 K/sec                    ( +-  4.12% )
             1      cpu-migrations            #    0.000 K/sec                  
         2,670      page-faults               #    0.107 K/sec                    ( +-  0.06% )
96,306,094,647      cycles                    #    3.873 GHz                      ( +-  0.61% )  (30.76%)
62,148,608,870      stalled-cycles-frontend   #   64.53% frontend cycles idle     ( +-  0.57% )  (30.77%)
85,662,975,314      instructions              #    0.89  insn per cycle         
                                              #    0.73  stalled cycles per insn  ( +-  0.78% )  (38.47%)
 7,918,922,528      branches                  #  318.425 M/sec                    ( +-  1.56% )  (38.48%)
   612,832,818      branch-misses             #    7.74% of all branches          ( +-  0.08% )  (38.49%)
20,044,500,112      L1-dcache-loads           #  806.001 M/sec                    ( +-  0.86% )  (30.76%)
   730,401,524      L1-dcache-load-misses     #    3.64% of all L1-dcache hits    ( +-  0.59% )  (15.38%)
   227,196,348      LLC-loads                 #    9.136 M/sec                    ( +-  1.33% )  (15.37%)
 5,047,936,497      L1-icache-load-misses                                         ( +-  0.15% )  (23.06%)
20,061,540,635      dTLB-loads                #  806.687 M/sec                    ( +-  0.94% )  (23.04%)
    25,453,753      dTLB-load-misses          #    0.13% of all dTLB cache hits   ( +-  6.39% )  (15.38%)
   144,126,990      iTLB-loads                #    5.795 M/sec                    ( +-  1.33% )  (15.38%)
    32,734,500      iTLB-load-misses          #   22.71% of all iTLB cache hits   ( +-  1.77% )  (23.07%)

        24.897 +- 0.151 seconds time elapsed  ( +-  0.61% )

With the write to the file here https://github.com/verilator/verilator/blob/dc5c259069f63b718ef3b7909724571c97beff53/include/verilated_vcd_c.cpp#L145 replaced with return len:

     23,112.02 msec task-clock                #    0.998 CPUs utilized            ( +-  0.43% )
            16      context-switches          #    0.001 K/sec                    ( +-  5.40% )
             1      cpu-migrations            #    0.000 K/sec                  
         2,670      page-faults               #    0.116 K/sec                    ( +-  0.02% )
89,469,493,309      cycles                    #    3.871 GHz                      ( +-  0.40% )  (30.75%)
57,573,902,061      stalled-cycles-frontend   #   64.35% frontend cycles idle     ( +-  0.31% )  (30.76%)
80,988,426,756      instructions              #    0.91  insn per cycle         
                                              #    0.71  stalled cycles per insn  ( +-  0.85% )  (38.46%)
 7,107,220,713      branches                  #  307.512 M/sec                    ( +-  1.79% )  (38.47%)
   580,847,111      branch-misses             #    8.17% of all branches          ( +-  0.23% )  (38.48%)
18,529,190,257      L1-dcache-loads           #  801.712 M/sec                    ( +-  0.98% )  (30.75%)
   529,066,400      L1-dcache-load-misses     #    2.86% of all L1-dcache hits    ( +-  0.54% )  (15.39%)
   143,452,344      LLC-loads                 #    6.207 M/sec                    ( +-  1.41% )  (15.39%)
 4,916,078,529      L1-icache-load-misses                                         ( +-  0.18% )  (23.08%)
18,530,147,654      dTLB-loads                #  801.754 M/sec                    ( +-  0.98% )  (23.06%)
    18,727,452      dTLB-load-misses          #    0.10% of all dTLB cache hits   ( +-  6.68% )  (15.38%)
   144,610,677      iTLB-loads                #    6.257 M/sec                    ( +-  2.34% )  (15.37%)
    27,244,562      iTLB-load-misses          #   18.84% of all iTLB cache hits   ( +- 15.59% )  (23.06%)

        23.152 +- 0.107 seconds time elapsed  ( +-  0.46% )

That is better than what strace says on the baseline:

strace -c ./obj_dir/Vtb_top +dumpon

...

% time     seconds  usecs/call     calls    errors syscall
 70.52    0.350573       29214        12         3 openat
 29.46    0.146466           1     99300           write
  0.01    0.000037           1        60           brk
  0.00    0.000012           1        18           mmap
  0.00    0.000010           0        41           read
  0.00    0.000008           1        10           fstat
  0.00    0.000005           1         9           close
  0.00    0.000003           3         1           lseek
  0.00    0.000000           0        10           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         6         6 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
100.00    0.497114                 99471         9 total

With the hacked up write to buffer version:

     17,603.76 msec task-clock                #    0.997 CPUs utilized            ( +-  0.26% )
            15      context-switches          #    0.001 K/sec                    ( +-  5.75% )
             1      cpu-migrations            #    0.000 K/sec                  
         2,710      page-faults               #    0.154 K/sec                    ( +-  0.01% )
68,155,221,067      cycles                    #    3.872 GHz                      ( +-  0.25% )  (30.75%)
46,663,153,952      stalled-cycles-frontend   #   68.47% frontend cycles idle     ( +-  0.38% )  (30.76%)
55,180,537,125      instructions              #    0.81  insn per cycle         
                                              #    0.85  stalled cycles per insn  ( +-  0.02% )  (38.46%)
 3,265,472,754      branches                  #  185.499 M/sec                    ( +-  0.15% )  (38.47%)
   103,437,751      branch-misses             #    3.17% of all branches          ( +-  2.17% )  (38.47%)
12,564,002,142      L1-dcache-loads           #  713.711 M/sec                    ( +-  0.06% )  (30.73%)
   766,630,856      L1-dcache-load-misses     #    6.10% of all L1-dcache hits    ( +-  0.06% )  (15.39%)
   109,530,431      LLC-loads                 #    6.222 M/sec                    ( +-  0.97% )  (15.39%)
 4,489,986,619      L1-icache-load-misses                                         ( +-  0.22% )  (23.08%)
12,562,130,358      dTLB-loads                #  713.605 M/sec                    ( +-  0.04% )  (23.05%)
    13,845,278      dTLB-load-misses          #    0.11% of all dTLB cache hits   ( +-  3.63% )  (15.37%)
   102,533,186      iTLB-loads                #    5.825 M/sec                    ( +-  1.36% )  (15.37%)
    17,419,785      iTLB-load-misses          #   16.99% of all iTLB cache hits   ( +-  8.26% )  (23.06%)

       17.6532 +- 0.0447 seconds time elapsed  ( +-  0.25% )

Note the instruction count difference. About 18% of the profile of the baseline is still inside VerilatedVcd::, which is mostly the formatting, but the rest of the difference is not checking for value changes in the trace functions via chg* which are inlined.

I'm fairly confident my hacked up version is accurate. Here is a function when compiled as in master:

00000000002bb00 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, 000000000002bb00 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, VerilatedVcd*, unsigned int)>:
Verila  unsigned int)>:
   2bb00:    mov                   rax, [rdi+0x10]
   2bb04:    add                   edx, 0x20
   2bb07:    mov                   rcx, [rsi+0xc8]
   2bb0e:    mov                   eax, [rax+0x14bc]
   2bb14:    cmp           [rcx+rdx*4], eax
   2bb17:    jne          2bb1a <Vtb_top::traceChgThis__4(Vtb_top__Syms*,  Veril  unsigned int)+0x1a>
   2bb19:    ret
   2bb1a:    lea                   rcx, [rcx+rdx*4]
   2bb1e:    mov                   rdi, rsi
   2bb21:    mov                   rsi, rcx
   2bb24:    mov                   edx, eax
   2bb26:    jmp          2bb2b <Vtb_top::traceChgThis__4(Vtb_top__Syms*,  Veril  unsigned int)+0x2b>
            2bb27: R_X86_64_PC32void VerilatedVcd::fullBus<28>(unsigned int*,  unsigned int)-0x4

And here is my version, note the 3 stores via rcx to the hypothetical trace buffer, and the write back of the incremented pointer via rdx. That's all that is run in the main thread, no tests, no writeback of new value, no formatting, no branches:

000000000003ab60 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, 000000000003ab60 <Vtb_top::traceChgThis__4(Vtb_top__Syms*, VerilatedVcd*, unsigned int)>:
Verila  unsigned int)>:
   3ab60:    mov                   rax, [rdi+0x10]
   3ab64:    add                   edx, 0x20
   3ab67:    shl                   rdx, 0x2
   3ab6b:    add                   rdx, [rsi+0xc8]
   3ab72:    mov                   rcx, [rsi+0x118]
   3ab79:    mov                   eax, [rax+0x14bc]
   3ab7f:    mov                 [rcx], 0x1
   3ab86:    mov             [rcx+0x8], rdx
   3ab8a:    lea                   rdx, [rcx+0x18]
   3ab8e:    mov           [rsi+0x118], rdx
   3ab95:    mov            [rcx+0x10], rax
   3ab99:    ret

I agree AIO would help when storing to regular storage, or god forbid NFS, but to me it seems there is scope for much more.

Interesting. Aio is "free" in that's there's almost zero overhead, while anything needing a new thread will have some overhead on both the current and other thread, harming total throughput, plus bandwidth to move the numbers cross-CPUs. But if you are interested does seem worth exploring e.g. pushing an e.g. code and raw bytes to another thread that formats and calls AIO.

Here's the numbers I think matter, which I intuit usually measured in per thousand instructions (/ki).

64.53% frontend cycles idle  

  7.17/ki    branch-misses         #    8.17% of all branches
  6.53/ki   L1-dcache-load-misses  #    2.86% of all L1-dcache hits
  0.23/ki      dTLB-load-misses    #    0.10% of all dTLB cache hits
 60.70/ki   L1-icache-load-misses
  0.33/ki      iTLB-load-misses    #   18.84% of all iTLB cache hits
  1.77/ki   LLC-loads

The frontend stalls are the real problem, 60/ki icache miss is crazy high. This is why the frontend is stalled so much. As we already knew the icache footprint is the real problem.

Actually.....

I wonder if a trace implementation that was totally data driven would be better? That is, instead of generated code calling the trace functions, generate a data table with pointers to the data. Then loop through and call the chg check on it. To avoid pointer following taking all of the time you'll need to VL_PREFETCH-ahead many entries, and VL_PREFETCH-ahead the pointers in the entries.

I think I agree with pretty much alll of the above. Threading does harm throughput in the sense that total work done over unit time at 100% system utilization is lower due to the threading overhead. I am mostly interested in the threading option for the day-use case when my one test have failed and I want to have a trace to debug sooner than later. In this case throughput is secondary to latency from a user perspective [at least around my end]. For nightly batch jobs, indeed just run single threaded for maximum throughput, though I rarely run those with tracing.

The data driven trace idea is interesting. One complication is that not all traced expressions are named and stored directly in the model, e.g.:

vcdp->chgBus(c+257,((7U & ((IData)(vlTOPp->tb_top__DOT__rvtop__DOT__lsu_freeze_dc3)
                            ? (IData)(vlTOPp->tb_top__DOT__rvtop__DOT__mem__DOT__Gen_dccm_enable__DOT__dccm__DOT____Vcellout__rd_addr_lo_ff__dout)
                            : (vlTOPp->tb_top__DOT__rvtop__DOT__swerv__DOT__lsu__DOT__lsu_lsc_ctl__DOT__full_addr_dc1`
                                        >> 2U)))),3);

I think these would still have to be traced as they are today, as the only alternative I see is computing them in eval and putting them in the symbol table as named values, which would hurt performance with tracing compiled in but turned off (I know that's not a preferred use case).

Agree that interactive use is a important case also.

For expressions in the tracing, these would presumably have to be calculated into temporary storage in the main trace thread, and then passed by pointer (to the temp) in the data structure.

OK, now I'm a bit confused again. I think we need to keep the trace threading and data driven tracing ideas separate (maybe a new issue for the latter?). I thought the idea with the data driven tracing was to have the main thread sweep the symbol table via the pointers (the same way as it does today with code). I don't think you can usefully offload the sweeping to a different thread, because the main thread at that point would have nothing to do (assuming only one verilated model exists in the execution) as it cannot modify the symbol table until the trace sweep is done. Thought that last assumption in the parentheses gives me another idea..

Data driven or not is orthogonal versus threaded or not, but the implementation of threading may differ greatly based on which way it's done, so I'd suggest figuring out if data driven is worth it by some experiments before trying threading.

As to data driven threading here's a possible flow:

main thread:

eval -> data to sym table
calculate values only needed for tracing -> data to sym table (currently these aren't in sym table)
wait for last trace idle ("A)
trigger trace thread ("B")
block until trace releases ("C")

trace thread:

forever:
- trigger main to indicate idle ("A")
- block until triggered by main ("B")
- foreach entry in trace data
- prefetch entry + X + Y
- prefetch (entry + X)->data_pointer
- read entry->data_pointer, this gets from var in symtable (data is in the main thread if it was written this cycle, if it wasn't might still be in tracing thread's cache.)
- if change (lastvaluep = newvaluep, mark entry "changed" flag). Note lastvalue is only used by this thread.
- notify main we've copied data, it can go to next cycle ("C")
- foreach trace data
- if "changed flag" dump this data to file.
- Call aio as appropriate

Note a read copy of the new value data has to move out of the main thread's cache to the trace thread's cache but isn't invalidated from the main thread nor evicted by other data (as it is now when tracing runs on the main threads).

In reality there doesn't need to be a single lock (A/B/C) instead there can be a lock for each mtask of data, so in theory if properly scheduled the main thread should have to wait very little.

@gezalore thanks for the awesome patches for this & think we can consider the performance even better than expected and so this is closable. (I'm presuming - if there's more here, that's fine too feel free to pull against this or reopen.)

verilator / verilator

General multi-threaded tracing #2259