Open tsaglam opened 1 year ago
As discussed: A metric regarding the number of tokens in submissions, marking outliers, would be interesting for the report viewer.
- Symmetric similarity: $\frac{2m}{a+b}$ (better name required)
This is equivalent to the Jaccard Index, aka. the Jaccard similarity coefficient, or intersection over union.
Also, it would be interesting to see more similarities that take into account the length of the matches (apart from "Longest Match") and maybe even the full sample of matches in the whole set of submissions.
One idea: "Longest Match" and "Overlap" both belong to a family of related similarities: Let $t$ be the vector of token sizes of the matches where $||t||_1=m$. Every vector norm applied on $t$ induces a similarity. The max norm induces "Longest Match" and the sum norm induces "Overlap". p-norms with p > 1, might be particularly interesting because they'd assign higher similarities to submissions with longer matches, compared to submissions with many short matches. This might allow choosing smaller minimal match sizes.
Let there be two submissions $F_a$ and $F_b$ with token lengths (number of tokens per submission) $a$ and $b$ as well as matched tokens $m$.
Currently, we only support two similarity metrics:
We could add different metrics:
To implement this, both the core and the report viewer need to support it. With this issue, we could also look at the similarity metric again, where it is used, and where it is not used.