open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.52k stars 222 forks source link

Text plagiarism checker #323

Open Mennaruuk opened 2 years ago

Mennaruuk commented 2 years ago

Project description

Plagiarism is when you rip off someone else's ideas or work and pretend it's your own. The previous sentence is an example of plagiarism.

Detecting plagiarism isn’t easy. Here are some problems:

  1. Just because some text matches another doesn’t mean the text has been plagiarized. The text could be enclosed by quotation marks, or it could be in a quote block without quotation marks. In a DOCX/ODT document, strings inside quotation marks (single or double) can be safely ignored.

  2. There is no fine line regarding what should be considered plagiarism. Matches involving a single word or two consecutive words are likely not to be plagiarism, and three or more identical words one after another can be very common. Take the example of idioms or large numbers in word form.

  3. Should there be a measure of proportion of plagiarism? Imagine if someone wrote a 100-page document, and 99% of the document was detected to not have been plagiarized, but there was a positive detection on page 29. Simply saying the document was plagiarized or contains plagiarism wouldn’t capture the full picture. That’s why companies like Unicheck formulated their own formulas that give an overall score.

  4. There exist a variety of methods to get around a plagiarism detector. Whether it’s inserting quotation marks at the start and end of the document and coloring these marks white to seem invisible to the public but visible to the plagiarism detector so it can skip checking the whole document; rasterizing DOCX/ODT; replacing Latin characters with Cyrillic ones or special characters; or inserting random white-colored (hence invisible) letters inside words; the list goes on and on. An accurate plagiarism detector has to take these circumvention methods and others into account. What should happen though? The best thing is to have a report section that details where these manipulations occur. That way, even if plagiarism cannot be detected, what will be detected is white on white letters, Cyrillic characters, etc., and the person checking will notice their presence and wonder why. (A good set of algorithms to detect this stuff is the Levenshtein Distance algorithms.) Same with a document that has a large proportion of its text enclosed in quotes.

  5. PDFs are a pain. There can be ways to detect plagiarism inside text embedded in PDFs, but if an open-source plagiarism detector can work on just TXT/DOCX/ODT files, I think that’s pretty darn good. PDFs are ubiquitous, though, but they are very outdated in a world of responsive design.

Relevant Technology

I tried searching for programming languages that are better fit for something like plagiarism detection. I keep seeing Java. However, this is probably meaningless: any other language, from Python to Go, can work.

Complexity and required time

Complexity

Required time (ETA)

Categories

ZigRazor commented 2 years ago

Good Idea, i can give you my support, I know well C++ and Python, and i also know Pytorch, that can be useful for machine learning tasks! If you want start the project take me in consideration! You can contact me at zigrazor@gmail.com