jplag / JPlag

State-of-the-Art Software Plagiarism & Collusion Detection
https://jplag.github.io/JPlag/
GNU General Public License v3.0
1.36k stars 311 forks source link

Add more similarity metrics #1134

Open tsaglam opened 1 year ago

tsaglam commented 1 year ago

Let there be two submissions $F_a$ and $F_b$ with token lengths (number of tokens per submission) $a$ and $b$ as well as matched tokens $m$.

Currently, we only support two similarity metrics:

We could add different metrics:

To implement this, both the core and the report viewer need to support it. With this issue, we could also look at the similarity metric again, where it is used, and where it is not used.

uuqjz commented 1 year ago

As discussed: A metric regarding the number of tokens in submissions, marking outliers, would be interesting for the report viewer.

SimDing commented 10 months ago
  • Symmetric similarity: $\frac{2m}{a+b}$ (better name required)

This is equivalent to the Jaccard Index, aka. the Jaccard similarity coefficient, or intersection over union.

Also, it would be interesting to see more similarities that take into account the length of the matches (apart from "Longest Match") and maybe even the full sample of matches in the whole set of submissions.

One idea: "Longest Match" and "Overlap" both belong to a family of related similarities: Let $t$ be the vector of token sizes of the matches where $||t||_1=m$. Every vector norm applied on $t$ induces a similarity. The max norm induces "Longest Match" and the sum norm induces "Overlap". p-norms with p > 1, might be particularly interesting because they'd assign higher similarities to submissions with longer matches, compared to submissions with many short matches. This might allow choosing smaller minimal match sizes.