jplag / JPlag

State-of-the-Art Software Plagiarism & Collusion Detection
https://jplag.github.io/JPlag/
GNU General Public License v3.0
997 stars 291 forks source link

ArrayIndexOutOfBoundsException within C Parser #1679

Open ShreyasKallingal opened 3 months ago

ShreyasKallingal commented 3 months ago

Hi, I'm trying to run JPlag 5.0.0 on a set of C file submissions. Parsing proceeds until around half way, at which point I encounter a classic array out of bounds exception in the underlying JavaCC parser. I wanted to debug further and isolate the submission causing the issue (the progress bar is delayed). But I could not figure out how to view trace-level logs in the terminal—any pointers? Thanks for the help!

Command: java -jar lib/jplag-5.0.0-jar-with-dependencies.jar -l c -r results -bc skeleton -d submissions

Output:

Loading Submissions   100% [======================================================================] 409/409 (0:00:00 / 0:00:00) 
2024-03-30-01:53:04_401 [INFO] SubmissionSetBuilder - Basecode directory "skeleton" will be used.
Parsing Submissions    52% [====================================                                  ] 216/409 (0:00:19 / 0:00:16) 
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 4096 out of bounds for length 4096
        at de.jplag.c.JavaCharStream.readByte(JavaCharStream.java:69)
        at de.jplag.c.JavaCharStream.readChar(JavaCharStream.java:112)
        at de.jplag.c.CPPScannerTokenManager.getNextToken(CPPScannerTokenManager.java:1688)
        at de.jplag.c.CPPScanner.jj_ntk_f(CPPScanner.java:1485)
        at de.jplag.c.CPPScanner.scan(CPPScanner.java:40)
        at de.jplag.c.CPPScanner.scanFile(CPPScanner.java:27)
        at de.jplag.c.Scanner.scan(Scanner.java:29)
        at de.jplag.c.CLanguage.parse(CLanguage.java:45)
        at de.jplag.Submission.parse(Submission.java:255)
        at de.jplag.SubmissionSet.parseSubmissions(SubmissionSet.java:159)
        at de.jplag.SubmissionSet.parseAllSubmissions(SubmissionSet.java:111)
        at de.jplag.SubmissionSet.<init>(SubmissionSet.java:49)
        at de.jplag.SubmissionSetBuilder.buildSubmissionSet(SubmissionSetBuilder.java:102)
        at de.jplag.JPlag.run(JPlag.java:73)
        at de.jplag.cli.CLI.runJPlag(CLI.java:132)
        at de.jplag.cli.CLI.main(CLI.java:90)
ShreyasKallingal commented 3 months ago

I've narrowed it down to large multi-line comments. Not quite sure if it's a JPlag bug or something upstream in JavaCC. I'll try to find a fix.

ShreyasKallingal commented 3 months ago

This does appear to be upstream in JavaCC. I replicated it by building JPlag from source and found that 2 files generated by JavaCC are the issue: AbstractCharStream and JavaCharStream. These are built from https://github.com/javacc/javacc/blob/master/src/main/resources/templates/JavaCharStream.template (I think). However, I have no idea how or why AbstractCharStream is generated—where's the source for this?

The bug is related to using the maxNextCharInd variable to keep track of 2 different buffers. Strangely, the expandBuff method in AbstractCharStream changes the value of maxNextCharInd, which JavaCharStream relies on to keep track of its fixed-size m_aNextCharBuf. The hacky fix is to define a new variable in JavaCharStream to use separately and define the class in the normal sources to override generation (https://www.mojohaus.org/javacc-maven-plugin/faq.html#custom-sources). But not sure where to fix this in JavaCC itself—maybe I'll file an issue with them.

TwoOfTwelve commented 3 months ago

As a workaround you can try parsing the files with the cpp language module. Most c code actually works with that.