Open torshie opened 6 months ago
Hi @torshie, that's a great question -- I think the answer is not entirely clear in the community, but as a starting point you can study the thresholds used in literature (e.g. the gopher rules or the rules used in refined web). Studying data quality and data mixes is an active area of research and getting such an understanding is one of the core motivations behind RPv2.
I am eager to know the reference thresholds of all signals too!
After all quality signals are generated, what are the thresholds used to classify a document as good/bad for each quality signal ?