risinglightdb / risinglight

An educational OLAP database system.
Apache License 2.0
1.61k stars 214 forks source link

feat: remove useless spawn_blocking #840

Closed xiaguan closed 7 months ago

xiaguan commented 7 months ago

This usage of spawn_blocking is useless. And waste some performance , since tokio may crate a thread for it.

skyzh commented 7 months ago

but the disk I/O will occupy the runtime thread and block other tasks from being scheduled?

skyzh commented 7 months ago

note that it's sync I/O here instead of async I/O inside the spawn_blocking

xiaguan commented 7 months ago

Here is a simple perf stat result. TPCH 10G data

select
    sum(l_extendedprice) as sum_base_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem;

Remove spawn_blocking

         14,113.10 msec task-clock                       #    1.173 CPUs utilized             
             7,085      context-switches                 #  502.016 /sec                      
                64      cpu-migrations                   #    4.535 /sec                      
           772,980      page-faults                      #   54.770 K/sec                     
    22,703,318,259      cycles                           #    1.609 GHz                         (82.99%)
     1,784,793,934      stalled-cycles-frontend          #    7.86% frontend cycles idle        (83.49%)
     6,680,958,786      stalled-cycles-backend           #   29.43% backend cycles idle         (83.24%)
    36,582,277,643      instructions                     #    1.61  insn per cycle            
                                                  #    0.18  stalled cycles per insn     (83.48%)
     6,194,729,232      branches                         #  438.935 M/sec                       (83.25%)
       194,448,934      branch-misses                    #    3.14% of all branches             (83.63%)

      12.033058573 seconds time elapsed

       8.076494000 seconds user
       6.023971000 seconds sys

The main branch

         23,975.43 msec task-clock                       #    1.418 CPUs utilized             
           356,000      context-switches                 #   14.849 K/sec                     
               281      cpu-migrations                   #   11.720 /sec                      
           657,275      page-faults                      #   27.415 K/sec                     
    29,943,693,477      cycles                           #    1.249 GHz                         (83.17%)
     2,389,946,991      stalled-cycles-frontend          #    7.98% frontend cycles idle        (83.23%)
     9,854,147,750      stalled-cycles-backend           #   32.91% backend cycles idle         (83.00%)
    37,148,855,213      instructions                     #    1.24  insn per cycle            
                                                  #    0.27  stalled cycles per insn     (83.37%)
     6,336,456,023      branches                         #  264.290 M/sec                       (83.31%)
       217,676,506      branch-misses                    #    3.44% of all branches             (83.95%)

      16.905472146 seconds time elapsed

      11.701711000 seconds user
      11.605966000 seconds sys

block_in_place

         17,933.10 msec task-clock                       #    1.329 CPUs utilized             
            50,364      context-switches                 #    2.808 K/sec                     
               120      cpu-migrations                   #    6.692 /sec                      
           958,696      page-faults                      #   53.460 K/sec                     
    28,158,279,564      cycles                           #    1.570 GHz                         (83.46%)
     2,583,071,171      stalled-cycles-frontend          #    9.17% frontend cycles idle        (83.29%)
     8,491,632,892      stalled-cycles-backend           #   30.16% backend cycles idle         (83.55%)
    38,026,410,766      instructions                     #    1.35  insn per cycle            
                                                  #    0.22  stalled cycles per insn     (83.13%)
     6,471,120,813      branches                         #  360.848 M/sec                       (83.45%)
       240,410,333      branch-misses                    #    3.72% of all branches             (83.22%)

      13.490252587 seconds time elapsed

       9.807809000 seconds user
       8.062861000 seconds sys
xiaguan commented 7 months ago

By the way, the CPU utilization rate of risinglight is very low even on some complex queries.

xiaguan commented 7 months ago

All scan executor's IO will be handle by get_block. This func's performance is very curial for us, we should consider refactor it?

wangrunji0408 commented 7 months ago

By the way, the CPU utilization rate of risinglight is very low even on some complex queries.

The aggregation would be limited by a single cpu core now. We don't partition data and make it parallel yet.

skyzh commented 7 months ago

Would you please try block_in_place and see if it improves perf?

xiaguan commented 7 months ago

Would you please try block_in_place and see if it improves perf?

Updated

skyzh commented 7 months ago

Thanks for your work and based on the benchmark result, I would rather opt in for block_in_place. The problem of having I/Os inside the tokio runtime thread is that it avoids other futures from being scheduled. This would only be a temporary relief until we do a full parallelism in the system.