Check for Duplicate Submissions

mxhdev commented 8 years ago

Introduction

After checking all submissions, the algorithm should compare the comments of all submissions in order to find duplicate submissions.

General Process

Parse all comments (either per-task tag or for the complete file)
Build all pairs of submissions (s_i, s_j) such that i != j
Compare the pairs of submissions and calculate a similarity score.
Write the pairs with their similarity score to a file. This file may contain the similariy for all pairs, ordered by the similarity score in descending order.
Problems / Things to Consider
- In a naive implementation, submissions without any comments might be seen as equal, while the sql statements could be completely different Solution 1: Encourage the students to write comments Solution 2: Account for this in the score calculation Solution 3: Let the algorithm also check for submissions which dont contain any comments and report those submissions as invalid
- No matter which libaray will be used, there will be different algorithms which allow to calculate a similarity score for two strings. A decision has to be made on checking which algorithm should be used Solution 1: Perform a small benchmark and define the desired score for some sample strings. Choose the algorithm which comes closest to this scoring Solution 2: Allow the user to define which algorithm should be used. The program should also set a reasonable default value

mxhdev commented 8 years ago

Regular expressions might be useful for finding all the comments

One Solution (Does not work for "#" comments) http://stackoverflow.com/questions/21017075/regex-to-find-sql-comments

timoei commented 8 years ago

Functionality was added with commit 3d9778778deeb776f1701c65c7921114e1be819d. Tasks without comments are documentated as "no comment found for [Submission]" in the report. Best algorithms for calculating the similarity score still has to be done. At the moment Levenshtein distance is implemented.

timoei commented 8 years ago

Changed with commit f7a043b722573dfd358004b38fe6a4cfc4a18983 to Cosine Similarity.

mxhdev / SQLChecker

Check for Duplicate Submissions #6

Introduction

General Process

Problems / Things to Consider