Support LLM based de-identification

omri374 commented 11 months ago

Is your feature request related to a problem? Please describe. LLMs usually do well in PII detection and de-identification. Using LLMs to identify PII in text could allow users to easily expand Presidio's capabilities with arbitrary PII entities and PII which is a characteristic of a person rather than an identifier (e.g. "He recently got divorced" vs. "His SSN is 1234")

Describe the solution you'd like Presidio currently supports multiple NER and NLP approaches for PII detection. Presidio proposes several NLPEngine instances for transformers, stanza and spacy. Creating one for LLM would be a simple integration of an LLM into Presidio. One possible way to achieve this is using spacy-llm which already has integrations with many LLM frameworks and models, and takes care of things like identifying the span of a PII entity discovered by an LLM.

Describe alternatives you've considered We can use LLMs in many steps in the de-identification pipeline. We have examples for using LLMs to generate fake data, we can use LLMs to identify PII in text, and we can use LLMs to do the end-to-end de-identification. While we can consider building all three capabilities, we should start with PII detection, in order to conform with the Presidio structure, and be able to leverage existing de-identification operators in presidio-anonymizer.

Additional context

Contributions welcome! There's plenty of docs on how the NlpEngine is structured, and existing code samples for integrating NLP frameworks into Presidio.

VMD7 commented 10 months ago

Hi @omri374 I have gone through the spacy-llm. Although its not stable yet but they are going to add in spacy's upcoming version. Also, here are some thoughts around this. -> Can we provide two options to user such as they can use spacy-llm or custom llm class where they can customize the prompt and entities etc if any new llm offer the capability. -> Like with the pretrained models we can get the entity confidence score. But, llm I think there is no score available but we can ask through prompt and use the existing anonymizer engine. Please let me know your thoughts on this.

omri374 commented 10 months ago

Hi @VMD7, thanks for this review. There are some challenges in this implementation, confidence being one of them. We can definitely evaluate other approaches for an LLM integration, such as https://github.com/vllm-project/vllm

cloudsere commented 8 months ago

Hey I think this is a great idea! Could you provide some demo code on how to integrate spacy-llm?

Do we need to create a customized nlp engine, similar to TransformersNlpEngine?

omri374 commented 8 months ago

I have a draft if this, but not ready yet. Yes, my approach is to create a new NlpEngine.

omri374 commented 8 months ago

If you'd like to give it a try and create a PR, we can collaborate on this.

cloudsere commented 8 months ago

Would love to collaborate on this! Could you draft a pull request for the work you've been focusing on?

Currently I'm doing some initial test trying to do NER using open ai gpt model in a customized nlp engine:

nlp.add_pipe(
      "llm_ner",
      config={
          "model": {
              '@llm_models': "spacy.GPT-4.v1",
              'endpoint': '<open ai endpoint>'
          }
      },
  )

But I have problem on how to calculate scores for the retrieved entities.

omri374 commented 8 months ago

Let met push my branch. It's very initial, and to be honest I'm not sure that's the best way to go, but we can brainstorm on this.

omri374 commented 8 months ago

@cloudsere this is taking me longer than usual to get to, so I'll just write down my initial design/thinking:

Requirements

We'd like the LLM to act as a NER model, and not to do E2E de-identification with the LLM, to allow the user to have more flexibility in the process, and to adhere to the existing Presidio Analyzer API.
LLMs are great at identifying entities, but not so great at returning the spans (start and end index) for each entity
We'd like the user to be able to use multiple types of LLMs and not just one (like GPT4).

Alternatives

I haven't found a lot of resources focusing on translating a text generation problem to a NER span detection problem. When you prompt an LLM to return entities, it would output a list of values that you can later parse to identify spans, but it could be challenging (for example, if the sentence is "bill paid the bill" if the detected entity is "bill").
One solution in this space is spacy-llm (still considered experimental) which has a post-processing phase to extract spans from detected entities. The specific logic can be found here: https://github.com/explosion/spacy-llm/blob/c87d5a6373485c7510d92b5fed08770f82c96c1a/spacy_llm/tasks/util/parsing.py#L15
If we prefer not to proceed with spacy-llm as it is still experimental, we can use something like litellm as a provider agnostic API to LLMs, and implement the post-processing within Presidio.
Another alternative is to ask the LLM to return a list of predictions per word/token. This would be easier to parse and have less assumptions, but could break if the LLM's generated tokenization isn't perfect (and it isn't). This is something we can experiment with. It looks something like this:

Prompt (simple just for illustration)

Here is a paragraph. Please return a list of JSONs. Each JSON represents an input word and its label (entity).  Entities could be things like places, names, organizations and personally identifiable information such as credit cards, bank account numbers, IBAN etc.
First, identify the named entities in the text. Then, map each word to its entity. Make sure you return every word even if it is not a detected entity. In this case, return "O". Don't add prefixes to the entity. Return "PERSON" instead of "B-PERSON"

Input text:
Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

Output:

Output:

[
  {"word": "Hello", "label": "O"},
  {"word": ",", "label": "O"},
  {"word": "my", "label": "O"},
  {"word": "name", "label": "O"},
  {"word": "is", "label": "O"},
  {"word": "David", "label": "PERSON"},
  {"word": "Johnson", "label": "PERSON"},
  {"word": "and", "label": "O"},
  {"word": "I", "label": "O"},
  {"word": "live", "label": "O"},
  {"word": "in", "label": "O"},
  {"word": "Maine", "label": "LOCATION"},
  {"word": ".", "label": "O"},
  {"word": "My", "label": "O"},
  {"word": "credit", "label": "O"},
  {"word": "card", "label": "O"},
  {"word": "number", "label": "O"},
  {"word": "is", "label": "O"},
  {"word": "4095-2609-9393-4932", "label": "CREDIT_CARD"},
  {"word": "and", "label": "O"},
  {"word": "my", "label": "O"},
  {"word": "crypto", "label": "O"},
  {"word": "wallet", "label": "O"},
  {"word": "id", "label": "O"},
  {"word": "is", "label": "O"},
  {"word": "16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ", "label": "CRYPTO_KEY"},
  {"word": ".", "label": "O"}
]

If anyone has a better idea than what I described here (I'm sure there is!) please respond to this thread.

omri374 commented 7 months ago

Initial (and experimental) code using spacy-llm can be found here: https://github.com/microsoft/presidio/pull/1340

magedhelmy1 commented 5 months ago

@omri374 this is quite a cool idea, is it now abandoned?

omri374 commented 5 months ago

@magedhelmy1 if you (or anyone else) would be interested in taking this initial approach and continuing developing it, that would be great!

dkarthicks27 commented 1 month ago

@cloudsere this is taking me longer than usual to get to, so I'll just write down my initial design/thinking:

Requirements

We'd like the LLM to act as a NER model, and not to do E2E de-identification with the LLM, to allow the user to have more flexibility in the process, and to adhere to the existing Presidio Analyzer API.

LLMs are great at identifying entities, but not so great at returning the spans (start and end index) for each entity

We'd like the user to be able to use multiple types of LLMs and not just one (like GPT4).

Alternatives

I haven't found a lot of resources focusing on translating a text generation problem to a NER span detection problem. When you prompt an LLM to return entities, it would output a list of values that you can later parse to identify spans, but it could be challenging (for example, if the sentence is "bill paid the bill" if the detected entity is "bill").

One solution in this space is spacy-llm (still considered experimental) which has a post-processing phase to extract spans from detected entities. The specific logic can be found here: https://github.com/explosion/spacy-llm/blob/c87d5a6373485c7510d92b5fed08770f82c96c1a/spacy_llm/tasks/util/parsing.py#L15

If we prefer not to proceed with spacy-llm as it is still experimental, we can use something like litellm as a provider agnostic API to LLMs, and implement the post-processing within Presidio.

Another alternative is to ask the LLM to return a list of predictions per word/token. This would be easier to parse and have less assumptions, but could break if the LLM's generated tokenization isn't perfect (and it isn't). This is something we can experiment with. It looks something like this:

Prompt (simple just for illustration)
Here is a paragraph. Please return a list of JSONs. Each JSON represents an input word and its label (entity).  Entities could be things like places, names, organizations and personally identifiable information such as credit cards, bank account numbers, IBAN etc.
First, identify the named entities in the text. Then, map each word to its entity. Make sure you return every word even if it is not a detected entity. In this case, return "O". Don't add prefixes to the entity. Return "PERSON" instead of "B-PERSON"

Input text:
Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

Output:
Output:

[ {"word": "Hello", "label": "O"}, {"word": ",", "label": "O"}, {"word": "my", "label": "O"}, {"word": "name", "label": "O"}, {"word": "is", "label": "O"}, {"word": "David", "label": "PERSON"}, {"word": "Johnson", "label": "PERSON"}, {"word": "and", "label": "O"}, {"word": "I", "label": "O"}, {"word": "live", "label": "O"}, {"word": "in", "label": "O"}, {"word": "Maine", "label": "LOCATION"}, {"word": ".", "label": "O"}, {"word": "My", "label": "O"}, {"word": "credit", "label": "O"}, {"word": "card", "label": "O"}, {"word": "number", "label": "O"}, {"word": "is", "label": "O"}, {"word": "4095-2609-9393-4932", "label": "CREDIT_CARD"}, {"word": "and", "label": "O"}, {"word": "my", "label": "O"}, {"word": "crypto", "label": "O"}, {"word": "wallet", "label": "O"}, {"word": "id", "label": "O"}, {"word": "is", "label": "O"}, {"word": "16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ", "label": "CRYPTO_KEY"}, {"word": ".", "label": "O"} ] If anyone has a better idea than what I described here (I'm sure there is!) please respond to this thread.

Great point, but when I tried implementing, something similar to this, I ran into tokenization compatibility issues, especially with special characters being detected, and longer texts.

So I passed a list of tokens, and asked it to return a list of labels back. Which can then be compared to check with original list of tokens, so that both are of the same length.

omri374 commented 1 month ago

Nice approach. Does it affect the detection accuracy in any way?

microsoft / presidio