Closed sebinside closed 2 years ago
Nearly done, I only need the markdown code from https://github.com/SimDing/JPlag/issues/7, as we do not have edit rights on this fork. Quote reply gives me the markdown, but the table is broken. We should copy it before the release.
@tsaglam this should be the code :) gh api /repos/SimDing/JPlag/issues/7
By default, JPlag is configured to perform a clustering of the submissions. The clustering partitions the set of submissions into groups of similar submissions. The found clusters can be used candidates for potentially colluding groups. Each cluster has a strength score, that measures how suspicious the cluster is compared to other clusters.
Clustering can take long when there is a large amount of submissions. Users who are not interested in the clustering can safely disable it:
--cluster-skip
optionProgrammatically:
JPlagOptions options = new JPlagOptions(\"/path/to/rootDir\", LanguageOption.JAVA);
options.setClusteringOptions(new ClusteringOptions.Builder().enabled(false).build());
JPlag jplag = new JPlag(options);
Clustering can either be configured using the CLI options or programmatically using the ClusteringOptions
class. Both options work analogous and share the same default values.
The clustering it designed to work out-of-the-box for running within the magnitude of about 50-500 submissions, but it can be tweaked when problems occur. For more submissions it might be necessary to increase Max-Runs
or Bandwidth
, so that an appropriate number of clusters can be determined.
Group | Option | Description | Default |
---|---|---|---|
General | Enable | Controls whether the clustering is run at all. | true |
General | Algorithm | Which clustering algorithm to use.
|
Spectral Clustering |
General | Metric | The similarity score between submissions to use during clustering. Each score is expressed in terms of the size of the submissions A and B and the size of their matched intersection A ∩ B .
|
MAX |
Spectral | Bandwidth | For Spectral Clustering, Baysian Optimization is used to determine a fitting number of clusters. If a good clustering result is found during the search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. Low values result in more exploration of the search space, high values in more exploitation of known results. | 20.0 |
Spectral | Noise | The result of each k-Means run in the search for good clusterings is random. The noise level models the variance in the \"worth\" of these results. It also acts as a regularization constant. | 0.0025 |
Spectral | Min-Runs | Minimum number of k-Means executions for spectral clustering. With these initial runs clustering sizes are explored. | 5 |
Spectral | Max-Runs | Maximum number of k-Means executions during spectral clustering. Any execution after the initial (min-) runs tries to balance between exploration of unknown clustering sizes and exploitation of clustering sizes known as good. | 50 |
Spectral | K-Means Iterations | Maximum number of iterations during each execution of the k-Means algorithm. | 200 |
Agglomerative | Threshold | Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. | 0.2 |
Agglomerative | inter-cluster-similarity | How to measure the similarity of two clusters during agglomerative clustering.
|
AVERAGE |
Preprocessing | Pre-Processor | How the similarities are preprocessed prior to clustering. Spectral Clustering will probably not have good results without it.
|
CDF |
✅ Documentation incorporated into the wiki and the repo!
The JPlag Documentation and README should contain the new information from the PRs #287 and #281.