Open Freakrider opened 1 week ago
https://www.ibm.com/blog/prevent-prompt-injection/
There some other solutions like Delimiter
A second Agent could possible improve stuff like // Logic for valid Output to prevent /improve something like getting full points for one word as well.
Excellent, can I ad your name to the changelog as someone who has given advice and made a suggestion. It will appear in this file. https://github.com/marcusgreen/moodle-qtype_aitext/blob/main/changelog.md
Feel free to add my name to the changelog.
Alexander Mikasch of Moodle.NRW (https://moodlenrw.de/)
If I find time I will introduce the qtype to my team and inspect it in more detail. Good work!
Thanks Alex, I have been giving it a lot of thought since yesterday. Can you email me at marcusavgreen at gmail.com
Issue
Describe the bug The system can be manipulated by user input to always return full marks and predefined feedback, which compromises the integrity of the automated grading process. The user can input a specific prompt that causes the LLM to disregard all previous inputs and system prompts, resulting in a JSON object that always gives full marks.
To Reproduce Steps to reproduce the behavior:
Expected behavior The system should correctly evaluate the user's input based on the provided criteria and marking scheme, without being influenced by manipulation prompts. The LLM should not be able to be overridden by specific user inputs that force it to give full marks.
Screenshots![Bildschirmfoto 2024-06-26 um 17 26 22](https://github.com/marcusgreen/moodle-qtype_aitext/assets/13573847/43fe6052-3cc1-4f37-9c06-65aa724fb673)
Desktop (please complete the following information): -all
Additional context This issue allows users to bypass the intended grading mechanism, resulting in unfair assessments and undermining the reliability of the automated grading process. Implementing stricter input validation and prompt handling can help prevent this exploitation.
Yes, the problem could potentially be solved using two agents. The idea is that the first agent processes the user's input and generates a preliminary score and feedback. The second agent then reviews the output of the first agent for manipulation attempts and ensures that the feedback and score adhere to the expected criteria. Here's an overview of how this could work:
Solution with Two Agents
First Agent (Scoring and Feedback Generator):
Second Agent (Validation and Security Check):
Example Workflow
User Input:
First Agent:
Second Agent:
Pseudocode
Here is a simplified pseudocode example of how this could be implemented: JUST A GETTING STARTED IDEA
This implementation provides an additional layer of security and ensures that the assessments are fair and accurate. Ofc its just an example that you get the idea.