Text plagiarism checker

Project description

Plagiarism is when you rip off someone else's ideas or work and pretend it's your own. The previous sentence is an example of plagiarism.

Detecting plagiarism isn’t easy. Here are some problems:

Just because some text matches another doesn’t mean the text has been plagiarized. The text could be enclosed by quotation marks, or it could be in a quote block without quotation marks. In a DOCX/ODT document, strings inside quotation marks (single or double) can be safely ignored.
There is no fine line regarding what should be considered plagiarism. Matches involving a single word or two consecutive words are likely not to be plagiarism, and three or more identical words one after another can be very common. Take the example of idioms or large numbers in word form.
Should there be a measure of proportion of plagiarism? Imagine if someone wrote a 100-page document, and 99% of the document was detected to not have been plagiarized, but there was a positive detection on page 29. Simply saying the document was plagiarized or contains plagiarism wouldn’t capture the full picture. That’s why companies like Unicheck formulated their own formulas that give an overall score.
There exist a variety of methods to get around a plagiarism detector. Whether it’s inserting quotation marks at the start and end of the document and coloring these marks white to seem invisible to the public but visible to the plagiarism detector so it can skip checking the whole document; rasterizing DOCX/ODT; replacing Latin characters with Cyrillic ones or special characters; or inserting random white-colored (hence invisible) letters inside words; the list goes on and on. An accurate plagiarism detector has to take these circumvention methods and others into account. What should happen though? The best thing is to have a report section that details where these manipulations occur. That way, even if plagiarism cannot be detected, what will be detected is white on white letters, Cyrillic characters, etc., and the person checking will notice their presence and wonder why. (A good set of algorithms to detect this stuff is the Levenshtein Distance algorithms.) Same with a document that has a large proportion of its text enclosed in quotes.
PDFs are a pain. There can be ways to detect plagiarism inside text embedded in PDFs, but if an open-source plagiarism detector can work on just TXT/DOCX/ODT files, I think that’s pretty darn good. PDFs are ubiquitous, though, but they are very outdated in a world of responsive design.

Relevant Technology

I tried searching for programming languages that are better fit for something like plagiarism detection. I keep seeing Java. However, this is probably meaningless: any other language, from Python to Go, can work.

Complexity and required time

Complexity

[ ] Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
[ ] Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
[x] Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

[ ] Little work - A couple of days
[ ] Medium work - A week or two
[x] Much work - The project will take more than a couple of weeks and serious planning is required

open-source-ideas / ideas

Text plagiarism checker #323

Project description

Relevant Technology

Complexity and required time

Complexity

Required time (ETA)

Categories