Add more similarity metrics

tsaglam commented 1 year ago

Let there be two submissions $F_a$ and $F_b$ with token lengths (number of tokens per submission) $a$ and $b$ as well as matched tokens $m$.

Currently, we only support two similarity metrics:

average similarity: $\frac{\frac{m}{a} + \frac{m}{b}}{2}$
maximum similarity: $max(\frac{m}{a}, \frac{m}{b})$ (good if one file is shorter)

We could add different metrics:

Symmetric similarity: $\frac{2m}{a + b}$ (better name required)
Overlap: $m$ (good if both students add junk files to their submissions)
Longest Match
Overall submission length

To implement this, both the core and the report viewer need to support it. With this issue, we could also look at the similarity metric again, where it is used, and where it is not used.

uuqjz commented 1 year ago

As discussed: A metric regarding the number of tokens in submissions, marking outliers, would be interesting for the report viewer.

SimDing commented 10 months ago

Symmetric similarity: $\frac{2m}{a+b}$ (better name required)

This is equivalent to the Jaccard Index, aka. the Jaccard similarity coefficient, or intersection over union.

Also, it would be interesting to see more similarities that take into account the length of the matches (apart from "Longest Match") and maybe even the full sample of matches in the whole set of submissions.

One idea: "Longest Match" and "Overlap" both belong to a family of related similarities: Let $t$ be the vector of token sizes of the matches where $||t||_1=m$. Every vector norm applied on $t$ induces a similarity. The max norm induces "Longest Match" and the sum norm induces "Overlap". p-norms with p > 1, might be particularly interesting because they'd assign higher similarities to submissions with longer matches, compared to submissions with many short matches. This might allow choosing smaller minimal match sizes.

jplag / JPlag

Add more similarity metrics #1134