How exactly is this an attack?

Hi, thanks for your interesting question!

Knowledge databases can be used to enhance an LLM to generate more accurate answers. However, it also introduces a new attack surface, which can be exploited by an attacker to perform various types of attacks. One example we used in our paper was an attacker injecting misinformation into the database to cause the RAG system to output incorrect answers. This was meant to help readers grasp the concept of this new attack surface. However, attackers could go beyond misinformation or incorrect answers and inject malicious instructions or other adversarial content by compromising the knowledge source (this is also shown by many recent studies). Our work highlights the security risks associated with the source of the knowledge itself. If we don’t carefully monitor the integrity of the data in knowledge databases, the entire RAG system can be compromised.

As discussed in our paper, the key aspect here is the intentional corruption of the knowledge base by an adversary. In our PoisonedRAG attack, the LLM behaves as expected when it processes the retrieved texts, but those texts have been maliciously injected. This is what makes it an "attack"—the LLM is being manipulated to generate attacker-chosen responses based on poisoned input, rather than factual or trustworthy information.

In this scenario, the LLM is not hallucinating or making random errors; it is using deliberately crafted misinformation as context to generate a wrong answer. This exposes a vulnerability in the RAG system, where the retriever pulls in the malicious texts, and the LLM has no way of detecting that the data has been tampered with. The anomalous behavior comes from the fact that the LLM is relying on corrupted knowledge injected by the attacker, as we detail in Section 3.1 and Section 4 of the paper.

One example that illustrates this is shown in Figure 1. Imagine a company providing RAG services using a Wikipedia snapshot as its knowledge database. Since Wikipedia can be publicly edited, an attacker could maliciously insert well-crafted misinformation into a Wikipedia page[37]. If this page is later included in a snapshot by a dataset provider like Wikimedia/Wikipedia and used by the RAG system, the attacker’s misinformation would become part of the system’s knowledge base. If a user then asks a related question (for example, Who is the CEO of OpenAI?), the RAG system might retrieve the corrupted information and provide an incorrect answer (e.g., Tim Cook), misleading the user. This highlights the importance of securing the knowledge source itself, as the entire system can be compromised by poisoning the external database.

sleeepeer / PoisonedRAG

How exactly is this an attack? #11