matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 276 forks source link

[Bug]: Incorrect stats in exec_plan #14107

Open sukki37 opened 10 months ago

sukki37 commented 10 months ago

Is there an existing issue for the same bug?

Branch Name

main,1.1-dev

Commit ID

3ac9acb8d

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

  1. S3 Scan Bytes vs Count Discrepancy: In cases where the byte count for an S3 scan is significantly large, the s3count is reported as 0. This seems to be an incorrect representation of the actual data processed.
  2. Disk IO Always Zero: The disk IO metric consistently shows a value of 0, regardless of the actual disk activity. This suggests that the disk IO is not being tracked or reported correctly.
  3. Stats Block Count vs Input Rows Inconsistency: There have been instances where the stats section reported a block count of around 3000, but the number of input rows was over 40 million. This contradicts the principle that each block can contain a maximum of 8192 rows, indicating a discrepancy in the stats reporting.
image

Expected Behavior

No response

Steps to Reproduce

run some sqls and check exec_plan

Additional information

No response

Morranto commented 9 months ago

正在修复

Morranto commented 9 months ago

1.原有的S3 Scan相当于对所有scan的数据都记录了,逻辑有问题需要修改,但是由于接口限制,不能做到每个node都记录s3 scan的信息,只能以单条sql为单位总体统计。 2.Disk IO原本是一个保留字段,没有任何记录。同样不能做到每个node都记录s3 scan的信息,只能以单条sql为单位总体统计。

Morranto commented 9 months ago

3.Stats Block Count这个字段是一个用于优化器阶段的估值,不能作为实际数据参考,具体详情需要咨询优化器那边的同事@badboynt1

Morranto commented 8 months ago

working on moc2152

Morranto commented 8 months ago

working on https://github.com/matrixorigin/MO-Cloud/issues/1804 https://github.com/matrixorigin/matrixone/issues/12126

Morranto commented 8 months ago

working on moc 1804

Morranto commented 8 months ago

working on moc 1804

Morranto commented 7 months ago

not working on it

Morranto commented 6 months ago

working on 15572

ouyuanning commented 6 months ago

在处理prepare 重构

ouyuanning commented 6 months ago

在处理prepare 重构

ouyuanning commented 6 months ago

在处理prepare 重构

ouyuanning commented 6 months ago

在处理prepare重构

ouyuanning commented 6 months ago

在处理prepare重构,还没时间看

ouyuanning commented 6 months ago

在处理prepare重构,还没时间看

ouyuanning commented 5 months ago

在处理prepare重构,还没时间看

ouyuanning commented 5 months ago

在处理prepare重构,还没时间看

ouyuanning commented 5 months ago

辛苦喜亮后面继续跟进吧

aressu1985 commented 5 months ago

和喜亮沟通过,1.2.1周期内解决风险较大,delay到1.3.0

qingxinhome commented 5 months ago

not working on it

qingxinhome commented 4 months ago

not working on it

qingxinhome commented 4 months ago

not working on it

qingxinhome commented 4 months ago

not working on it

qingxinhome commented 3 months ago

not working on it

qingxinhome commented 3 months ago

analzye代码重构中

qingxinhome commented 3 months ago

analzye代码重构中

qingxinhome commented 3 months ago

analzye代码重构中

qingxinhome commented 3 months ago

analzye代码重构中

qingxinhome commented 2 months ago

analzye代码重构中

qingxinhome commented 2 months ago

analzye代码重构中

qingxinhome commented 2 months ago

analzye代码重构中

qingxinhome commented 2 months ago

analzye代码重构pr已提交

qingxinhome commented 2 months ago

算子分析器重构pr已合并,后期观察该问题是否仍然存在

qingxinhome commented 1 month ago

pr已合并

qingxinhome commented 1 month ago
  1. S3IOInputCountS3IOOutputCount统计:

    S3IOInputCountS3IOOutputCount是S3 IO的次数统计,通过在Compile的Context中埋入counterSet作为观察者,由Fileservice服务统计sql执行期间S3 IO次数。

    目前S3IO资源的统计无法落实到具体的算子上, 因此, 逻辑计划中Node上的S3IOInputCountS3IOOutputCount始终为0, 因为无法具体到某一个算子和Node,S3IOInputCountS3IOOutputCount只能是SQL执行期间的整体消耗

    解决方案:提feature, 将FileService中对S3IOInputCountS3IOOutputCount统计具体到算子

    田博士认为,如果无法以算子为基准统计S3IO 就是bug

  2. S3IOByte 统计

    tableScan算子统计,并最终落实到对应Table_Scan Node上,mo一直将table_scan通过reader读取的数据大小作为S3IOByte的值, 但是Table_Scan reader读出来的数据不一定来自S3,也有可能来自于内存, 因此:

    • S3IOByteS3IOInputCountS3IOOutputCount没有直接对应关系

    • S3IOByte统计属性命名有误, 建议重新命名

​ 参考snowflake: scan的数据大小称之为Bytes scanned

{
    "name": "Bytes scanned",
    "value": 262144,
    "unit": "bytes"
}
 **解决方案**:`S3IOByte`统计属性改名为ScanBytes,pr已合并(解决)
  1. InputBlock统计:

    InputBlock是由tableScan算子统计的reader读取的block块的计数,从reader中读取的数据是以block块为单位, block数据要么来自于S3, 要么来自于内存, 但是block块不一定是饱和状态(8192)

    不需要修改

  2. Disk IO: 该属性一直是被弃用的, 无论是算子还是FileSerive,都未对Disk IO做统计, 因此Disk IO值一直都是0

    另外,mo也算子无法感知数据存储介质,我的理解云上环境mo使用的存储介质只有内存和S3

    解决方案: 将Disk IO由FileService统计, 田博士认为,如果无法以算子为出发统计Disk IO 就是bug

    @sukki37

qingxinhome commented 1 month ago

已经处理一部分

qingxinhome commented 1 month ago

已经处理一部分

qingxinhome commented 1 month ago

处理中

qingxinhome commented 3 weeks ago

二次重构pr已提交

qingxinhome commented 2 weeks ago
  1. S3IOInputCount和S3IOOutputCount统计: S3IOInputCount和S3IOOutputCount是S3 IO的次数统计,通过在Compile的Context中埋入counterSet作为观察者,由Fileservice服务统计sql执行期间S3 IO次数。 目前S3IO资源的统计无法落实到具体的算子上, 因此, 逻辑计划中Node上的S3IOInputCount和S3IOOutputCount始终为0, 因为无法具体到某一个算子和Node,S3IOInputCount和S3IOOutputCount只能是SQL执行期间的整体消耗

解决方案:根据会议要求,不再将S3资源归纳为S3IOInputCount和S3IOOutputCount,根绝S3资源请求方式进行统计,分为:List,Head,Put, Get, Delete, DeleteMul, 并且实现以算子为基准统计S3资源,并体现在对应的Plan Node中

  1. S3IOByte 统计 由tableScan算子统计,并最终落实到对应Table_Scan Node上,mo一直将table_scan通过reader读取的数据大小作为S3IOByte的值, 但是Table_Scan reader读出来的数据不一定来自S3,也有可能来自于内存, 因此:

    • S3IOByte和S3IOInputCount和S3IOOutputCount没有直接对应关系
    • S3IOByte统计属性命名有误, 会议建议重新命名 解决方案:S3IOByte统计属性改名为ScanBytes
  2. InputBlock统计:
    InputBlock是由tableScan算子统计的reader读取的block块的计数,从reader中读取的数据是以block块为单位, block数据要么来自于S3, 要么来自于内存, 但是block块不一定是饱和状态(8192) 结论:不需要修改

  3. Disk IO: 该属性一直是被弃用的, 无论是算子还是FileSerive,都未对Disk IO做统计, 因此Disk IO值一直都是0 另外,mo也算子无法感知数据存储介质,我的理解云上环境mo使用的存储介质只有内存和S3 解决方案: 将Disk IO由FileService统计, 当使用磁盘作为存储介质时,会统计算子的DiskIO使用量,以字节为单位,并体现在对应的Plan Node中

The issue has been resolved, please verify and test