[METRICS] Topic Metric Store metric 收集不完全

chinghongfang commented 1 year ago

使用 #1524 實做的 CostAwareAssignor 做兩次實驗，兩次分別使用 local metric store 和 topic metric store。發現使用 topic metric store 的 consumer 會較晚開始消費。

指令使用：

JMX_PORT=7091 ACCOUNT=chinghongfang VERSION=mergeAssignor HEAP_OPTS=-Xmx24G /home/kafka/astraea/docker/start_app.sh performance --bootstrap.servers 192.168.103.185:9092 --value.size 10KiB --value.distribution fixed --run.until 10m --producers 1 --consumers 1 --read.idle 20m --topics simple --partitioner org.astraea.common.partitioner.StrictCostPartitioner --configs metric.store=local,jmx.port=7091,partition.assignment.strategy=org.astraea.common.assignor.CostAwareAssignor,max.retry.time=10m,metric.store.expiration.duration=3m

叢集有 6 個 broker，只有一個 topic ，該 topic 有 12 個 partition，手動用碼表計時 "performance tool 執行" -> "consumer 開始消費的時間"，

Consumer with local metric store: ~6 sec
Consumer with topic metric store: ~46 sec

topic metric store 明顯需要較多時間收集完 metric ，離開 MetricStore#wait。

實驗有初步從 Cost function 觀測 metric 獲取狀況，發現 "topicPartition 有在更新，但有些保持 0.0"，故推測 topic metric store 或 metric publisher 有漏 metric ，導致 MetricStore#wait 一直等不到完整數據。

將會繼續追查，看看是 publisher 還是 metric store 的問題。

chinghongfang commented 1 year ago

追查後推測是 2 件事造成

沒有流量的時候，計算出來的 partition cost 會出現 NaN
topic metric store 從最舊的 offset 開始消費

在實驗中，所有的 topic 會先刪除然後重建，再來啟動 metric publisher 收集 metric (每隔 1 秒收集一次)。所以一開始存放的 BrokerTopicMetrics ByteInPerSec OneMinuteRate 會是 0.0 ，而讓之後的 partition cost 出現 NaN，所以其實並不是 metric 沒有收集到，而是計算出現 NaN。所以等 broker 有流量進入後，就可以算出非 NaN 的值。

至於需要較多時間才可以開始消費 (assignor 分配完畢)，wait 較多時間 (~45 秒) 可能是因為從最早的 offset 開始消費，改成使用 latest offset 開始消費，wait 時間有減少許多 (~16 秒)。當初會使用 earliest offset 是因為以為可能會需要使用歷史資料，所以從頭開始收集，但若是會影響到收集時間，我想還是效能優先。

這次另外發現一個議題，當 broker 都沒有進入流量時， CostAwareAssignor 的 wait 會一直等到 timeout 或者是有流量進入。

chia7712 commented 1 year ago

這次另外發現一個議題，當 broker 都沒有進入流量時， CostAwareAssignor 的 wait 會一直等到 timeout 或者是有流量進入。

@harryteng9527 any feedback?

harryteng9527 commented 1 year ago

目前判斷要不要 wait 的方式是讓 Assignor 去壓 partitionCost，看算出來的值是不是 NaN，經過發現是去 normalize 的時候，有 0.0 除以 0.0 的情況發生。

我覺得目前有兩種方向可以改善：

繼續用現在的判斷，只是要去看節點中的總 cost 有沒有為 0，若為 0 則該節點上的 partition cost 皆設成 0.0
把判斷有沒有撈完全的責任交給 cost function。如果沒撈到完整的 beanObject 就丟 exception 上去給 assignor 那邊 wait。不過要這樣改的話要考慮怎麼確認撈回來的 clusterBean 是完整的

不知學長建議如何改動？

我可能比較傾向 1.，因為要判斷 clusterBean 的完整性可能比較複雜，每個 costFunction 要撈的 bean 數量都不一樣，所以如果用 2.，可能要為每個 cost function 計算要撈的 bean 數量

chia7712 commented 1 year ago

繼續用現在的判斷，只是要去看節點中的總 cost 有沒有為 0，若為 0 則該節點上的 partition cost 皆設成 0.0

所以方案一是沒有 wait 的嗎？由於 assignro 決定的分佈會運行好一陣子，因此我們可能要盡量比免“草率”的決定

opensource4you / astraea

[METRICS] Topic Metric Store metric 收集不完全 #1810