penguin-statistics / backend-next

The refactored Penguin Statistics v3 Backend. Built with Go, fiber, bun and go.uber.org/fx. Uses NATS as MQ and Redis as state synchronization.
MIT License
179 stars 19 forks source link

Report Z-test Validations & `DropInfo` Self-adoption #53

Open GalvinGao opened 2 years ago

GalvinGao commented 2 years ago

Currently there's only several simple and, if not naive, approaches on report validation. Previously we've proposed Z-test mechanism, and implemented on the previous backend. However, due to the MongoDB evaluation bottleneck existing on the previous backend, we unfortunately have to disable such feature due to high performance drawbacks.

The backend-next project now has the capability on both flexibility and performance extensibility to allow us relaunch such mechanism on checking the reports.

Moreover, currently the DropInfo section is somewhat artificially decided and might not be suitable for the first several hundred reports, due to the nature that we can't predict what is actually the finite set of drop possibilities, so there previously have existed several issues related to DropInfo not being applied properly at the first, causing potentially deviations for the dataset. Although we've been fixing those actively manually, those are time-consuming and as well not an optimal solution at all. Therefore, there could also be a mechanism where DropInfo itself could adopt continuously with the growth of the report dataset. However the implementation detail of the adoption is still a huge topic to discuss.


Just to note down here that, those statistics-based tests are all pretty susceptible to attacks where the attacker could aim to report several hundred or about a thousand of false reports after the very first moments the stage opens, causing the dataset to converge to a skewed result. Any results afterwards would consider invalid and therefore rejecting the true reports. Such attack could be mitigated by either randomly picking reports across different accounts, IPs, and carefully designing the threshold when Z-test kicks in, to minimize the effect such attack could bring.

FlandiaYingman commented 2 years ago

On the point of view of attack prevention, we could run a two proportion z-test, which allows us to compare two proportions (in our case, they could be reports categorized by IP or account) to see if they are the same. If the result shows that two proportions are not the same, we have reason to suspect that one of the proportion is false.

And we may not state the null hypothesis based on the first n reports, instead, we could state it based on first n groups of reports, which reports can be grouped by IP, account or report method (by recognition or manually). Thus, our dataset wouldn't be affected by a huge count of reports from a single source.