Allow PresidioAnonymizer() to redact data instead of anonymizing.

Siddhartha90 commented 9 months ago

Feature request

The langchain library only supports anonymizing it seems. The native microsoft library can redact data:

analyzer_results = analyzer.analyze(text=text_to_anonymize,
         entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "URL", "LOCATION"],
         ad_hoc_recognizers=ad_hoc_recognizers,
         language='en')

  anonymized_results = anonymizer.anonymize(
       text=text_to_anonymize,
       analyzer_results=analyzer_results,
       operators={"DEFAULT": OperatorConfig("redact", {})}
 )

I don't see a way to do this via langchain_experimental.data_anonymizer

Motivation

Redaction is cleaner than anonymizing. It's better than replacing first names with gibberish for my. use case

Your contribution

I can help provide use-cases, like i did above.

dosubot[bot] commented 9 months ago

🤖

Hello @Siddhartha90! I'm Dosu, a bot designed to assist you with your queries, solve bugs, and guide you in becoming a contributor while we wait for a human maintainer. Please feel free to ask me anything related to the LangChain repository.

Thank you for your feature request, Siddhartha90. I understand that you would like to have a redaction feature in the LangChain library, similar to the one in the Microsoft library.

Based on the current implementation of the PresidioAnonymizer class in LangChain, it seems that it does not support a "redact" operator similar to the one in the Microsoft library. The PresidioAnonymizer class in LangChain currently supports data anonymization, which involves replacing identified sensitive information in a given text with anonymized data.

However, it is possible to add a "redact" operator to the PresidioAnonymizer class. This would involve adding a new operator to the operators dictionary in the __init__ method of the PresidioAnonymizerBase class. This operator would be an instance of OperatorConfig with the operator_name set to "redact". Here is an example of how you could add this:

from presidio_anonymizer.entities import OperatorConfig

class PresidioAnonymizerBase(AnonymizerBase):
    def __init__(
        self,
        analyzed_fields: Optional[List[str]] = None,
        operators: Optional[Dict[str, OperatorConfig]] = None,
        languages_config: Dict = DEFAULT_LANGUAGES_CONFIG,
        add_default_faker_operators: bool = True,
        faker_seed: Optional[int] = None,
    ):
        # ...
        if add_default_faker_operators:
            self.operators = {
                field: OperatorConfig(
                    operator_name="custom", params={"lambda": faker_function}
                )
                for field, faker_function in get_pseudoanonymizer_mapping(
                    faker_seed
                ).items()
            }
        else:
            self.operators = {}

        # Add redact operator
        self.operators["redact"] = OperatorConfig("redact", {})

        if operators:
            self.add_operators(operators)
        # ...

Please note that the "redact" operator is not a built-in operator in the Presidio library. You would need to implement the functionality for this operator yourself. The above code only adds the operator to the list of operators, but does not implement the functionality for the operator.

I hope this helps! If you have any further questions or need more clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Siddhartha90 commented 9 months ago

I was actually able to do this more simply:

anonymizer = PresidioAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "LOCATION"],
    operators={
        "PERSON": OperatorConfig("redact", {}),
        "PHONE_NUMBER": OperatorConfig("redact", {}),
        "EMAIL_ADDRESS": OperatorConfig("redact", {}),
        "LOCATION": OperatorConfig("redact", {}),
    }
)

stephenleo commented 8 months ago

To add upon @Siddhartha90's excellent answer. you can find more details on how to configure the OperatorConfig from here: https://microsoft.github.io/presidio/anonymizer/

Example:

anonymizer = PresidioAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "LOCATION"],
    operators={
        "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED NAME]"}),
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[REDACTED PHONE NUMBER]"}),
        "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[REDACTED EMAIL]"}),
        "LOCATION": OperatorConfig("replace", {"new_value": "[REDACTED PLACE]"}),
    },
)

langchain-ai / langchain