juzibot / donut-tester

40 stars 4 forks source link

监控报警中DDR的计算规则 #33

Open su-chang opened 4 years ago

su-chang commented 4 years ago

DDR(ding-dong rate)计算规则

规则介绍

  1. 微信号首次登录会进行数据初始化

    DingDongObject {
     dingNum: number;
     dongNum: number;
     warnNum: number;
     onlineTime: number;
     offlineTime: number;
    }
  2. 每间隔1分钟由官方公众号BotSentry发一条#ding的消息,对应微信号自动回复dong,以此记做一个完整的统计周期

  3. 若超过90秒未回复dong,则记做一次超时

  4. 连续3次超时,公众号会向管理员进行报警

  5. 每次重新登录会继承之前的数据,并将离线时长按分钟数转换为发送#ding的次数

计算公式

DING_NUM = (NOW_TIME - OFFLINE_TIME) / 60 + PRE_DING_NUM
DDR = DONG_NUM / DING_NUM

数据说明

NOW_TIME: 微信号登录的时间 OFFLINE_TIME: 微信号掉线的时间 PRE_DING_NUM: 上次登录期间公众号向微信号发送#ding的总次数 DING_NUM: 公众号向微信号发送#ding的总次数 DONG_NUM: 微信号回复公众号dong的总次数

huan commented 4 years ago

Thanks for sharing your algorithm for calculating the DDR!

I'd like to suggest that we should remove the PRE_DING_NUM and OFFLINE_TIME in the formula because what we are trying to optimize is to maximize the service availability of our bots.

  1. I believe the PRE_DING_NUM is not necessary because we should focus on the latest status of our DDR
  2. The result will be more meaningful when treating the OFFLINE_TIME as FAIL:
    1. If the bot is owned by ourselves: we should make the bot back to online when it's offline, as soon as possible
    2. If the bot is owned by our alpha testers: they should make the bot back online when it's offline, as soon as possible. We will revoke the testing token from the bottom 10% of them (10% is for example).
su-chang commented 4 years ago

One Case:

If one bot start login at Monday, and the WeChat dump at Wednesday, and re-login at Friday.

Now we will calculate the DDR when the next login, so #ddr in Thursday: 100%, offline #ddr in Friday: 50%, online

after we removed the PRE_DING_NUM and OFFLINE_TIME

Maybe we should calculate all bot DDR when we give the order #ddr #ddr in Thursday: 66.6%, offline #ddr in Friday: 100% online

Which one is better for you? @huan @lijiarui @windmemory

huan commented 4 years ago

Your case is exactly what we should expect because the DDR is a test to check whether the bot is online or not.

If the bot has been offline, then it will not be able to respond. The longer it does not come back, the lower DDR Rate will be evaluated.

So, what we need to do is: when a bot has been offline, we need to try to do our best to take it back, as soon as possible.

Does that make sense?

su-chang commented 4 years ago

Yes, I agree!

I will change the algorithm later. Thank you very much!

huan commented 4 years ago

You are welcome!

And great you know that you agree with me, cheers!

windmemory commented 4 years ago

Let's say for this case:

image

The bot is offline for Thursday and Friday the whole day, from 0:00 to 24:00, and all the other time are always online.

Q1: when we check the DDR at Sun 24:00, what is the expected rate?

  1. 5 / 7 ≈ 71%
  2. 2 / 2 = 100%

In this case, the PRE_DING_NUM will be the dings in Mon to Wed. I think we should calculate it. And the OFFLINE_TIME is Wed 24:00. And I think we should include these two arguments this in the formula. Using

DING_NUM = (NOW_TIME - OFFLINE_TIME) / 60 + PRE_DING_NUM
DDR = DONG_NUM / DING_NUM

will give the first result, removing the PRE_DING_NUM and OFFLINE_TIME will get the second result. I will prefer the first one.

huan commented 4 years ago

I prefer the first one too: 5 / 7 ≈ 71%

windmemory commented 4 years ago

@huan so is this formula looks good to you?

DING_NUM = (NOW_TIME - OFFLINE_TIME) / 60 + PRE_DING_NUM
DDR = DONG_NUM / DING_NUM
huan commented 4 years ago

To be honest, I do not think that the Ding Dong Rate is related to any of time variables.

We can just calculate the numbers of Ding and Dong.

So I'd like to remove all the time related conceptions from our formula first, then I believe we will be good.

windmemory commented 4 years ago

To be honest, I do not think that the Ding Dong Rate is related to any of time variables.

We can just calculate the numbers of Ding and Dong.

So I'd like to remove all the time related conceptions from our formula first, then I believe we will be good.

That will be the most ideal design, and the difference from the current design is that: Current design will not send ding to the bot when the bot is offline, so purely count the number of ding in the current design will not get the expected rate that we want, that's why we see those time concepts in the formula. There are two reasons that we don't send ding to the bot during offline:

  1. We are using WeChat Official Account to monitor the status. If the Official Account keep sending ding during the bot offline, the message will be blocked, then if the bot get back online, we will not be able to correctly send ding to the bot, results in unexpected rate data. The message block is not documented, but we've seen this in our test.
  2. If the developer manually logout the bot from Wechaty, and the bot sentry keep sending ding to the bot, it will be really annoyed

So we stop sending ding to the bot during the offline, which results in the time related concepts into the formula, does this make sense to you? @huan

huan commented 4 years ago

Yes that's make sense to me.

However, we can simply ignore all those problems by introducing a new concept: effective ding number, EDN.

The EDN is the number of ding that we SHOULD emit. Then we will get everything done beautifully:

DDR = Dong Number Received / EDN

windmemory commented 4 years ago

However, we can simply ignore all those problems by introducing a new concept: effective ding number, EDN.

The EDN is the number of ding that we SHOULD emit. Then we will get everything done beautifully:

DDR = Dong Number Received / EDN

Sure, let's use the concept of EDN.

Then let's make it clear for how we get the EDN:

to calculate the EDN, we are using the DING_NUM(total number of ding sent) and ODN (offline ding num) together, the DING_NUM is actually the count, but the ODN is calculated with the offline duration.

ODN = (ONLINE_TIME - OFFLINE_TIME) / 60
EDN = DING_NUM + ODN

Is this okay?

huan commented 4 years ago

I believe there's easier and simplied way to get this number:

EDN = DURATION / INTERVAL

However, any formula is ok, as long as it calculating the right EDN number.

windmemory commented 4 years ago

EDN = DURATION / INTERVAL

This looks good to me, we can go with this one.