✨ performance target: filter 1 billion rows per minute

as a starting point, reading just under 0.5 billion records of existing Tweet data in 30 minutes.... There's a long way to get to the target.

SELECT count(*) FROM minio.opteryx.tweets FOR DATES BETWEEN '2021-01-05' AND '2022-12-31' WHERE user_verified IS false

┌───┬───────────┐
│   │  COUNT(*) │
│   │  INTEGER  │
╞═══╪═══════════╡
│ 1 │ 490952012 │
└───┴───────────┘
(1, 1)
{
  "blobs_read": 8334,
  "bytes_processed": 61923015,
  "bytes_read": 47175346935,
  "cache_evictions": 1,
  "cache_misses": 8334,
  "cache_oversize": 1099,
  "columns_read": 1,
  "rows_read": 494599469,
  "rows_seen": 494599469,
  "stalls_reading_from_read_buffer": 129,
  "stalls_writing_to_read_buffer": 55279,
  "time_evaluating": 3.337147,
  "time_exiting": 2.2667e-05,
  "time_planning": 1.013283,
  "time_planning_ast_rewriter": 3.7375e-05,
  "time_planning_binder": 0.994843792,
  "time_planning_optimizer": 0.000640625,
  "time_planning_physical_planner": 0.000203625,
  "time_planning_sql_rewriter": 0.000129709,
  "time_reading_blobs": 9051.378091245,
  "time_selecting": 5.725753,
  "time_total": 1970.092488,
  "messages": []
}
0.16.5
Execution time: 1970.098755 seconds
Peak CPU usage: 416.10%
Peak memory usage: 2270.45 MB
Memory Pool (ReadBuffer <minio.opteryx.tweets>) <size=1717986918, commits=8334 (55279), reads=9443, releases=8334, L1=3225, L2=2522>
Memcached <hits=0 misses=0 sets=0 skips=0 errors=0 touches=0>

These stats show a huge number of stalls writing to the read buffer, which indicates that the readers are currently saturating the processing. That's not suggesting that improvements to read are not required, but it is suggesting that improvements to the readers will only cause more stalls rather than improvements to query times.

Although, what is interesting, is the lack of other statistics which would account for the time. This could be the aggregation step.

mabel-dev / opteryx

✨ performance target: filter 1 billion rows per minute #1901