Closed Siddhartha90 closed 9 months ago
🤖
Hello @Siddhartha90! I'm Dosu, a bot designed to assist you with your queries, solve bugs, and guide you in becoming a contributor while we wait for a human maintainer. Please feel free to ask me anything related to the LangChain repository.
Thank you for your feature request, Siddhartha90. I understand that you would like to have a redaction feature in the LangChain library, similar to the one in the Microsoft library.
Based on the current implementation of the PresidioAnonymizer
class in LangChain, it seems that it does not support a "redact" operator similar to the one in the Microsoft library. The PresidioAnonymizer
class in LangChain currently supports data anonymization, which involves replacing identified sensitive information in a given text with anonymized data.
However, it is possible to add a "redact" operator to the PresidioAnonymizer
class. This would involve adding a new operator to the operators
dictionary in the __init__
method of the PresidioAnonymizerBase
class. This operator would be an instance of OperatorConfig
with the operator_name
set to "redact". Here is an example of how you could add this:
from presidio_anonymizer.entities import OperatorConfig
class PresidioAnonymizerBase(AnonymizerBase):
def __init__(
self,
analyzed_fields: Optional[List[str]] = None,
operators: Optional[Dict[str, OperatorConfig]] = None,
languages_config: Dict = DEFAULT_LANGUAGES_CONFIG,
add_default_faker_operators: bool = True,
faker_seed: Optional[int] = None,
):
# ...
if add_default_faker_operators:
self.operators = {
field: OperatorConfig(
operator_name="custom", params={"lambda": faker_function}
)
for field, faker_function in get_pseudoanonymizer_mapping(
faker_seed
).items()
}
else:
self.operators = {}
# Add redact operator
self.operators["redact"] = OperatorConfig("redact", {})
if operators:
self.add_operators(operators)
# ...
Please note that the "redact" operator is not a built-in operator in the Presidio library. You would need to implement the functionality for this operator yourself. The above code only adds the operator to the list of operators, but does not implement the functionality for the operator.
I hope this helps! If you have any further questions or need more clarification, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I was actually able to do this more simply:
anonymizer = PresidioAnonymizer(
analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "LOCATION"],
operators={
"PERSON": OperatorConfig("redact", {}),
"PHONE_NUMBER": OperatorConfig("redact", {}),
"EMAIL_ADDRESS": OperatorConfig("redact", {}),
"LOCATION": OperatorConfig("redact", {}),
}
)
To add upon @Siddhartha90's excellent answer. you can find more details on how to configure the OperatorConfig
from here: https://microsoft.github.io/presidio/anonymizer/
Example:
anonymizer = PresidioAnonymizer(
analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "LOCATION"],
operators={
"PERSON": OperatorConfig("replace", {"new_value": "[REDACTED NAME]"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[REDACTED PHONE NUMBER]"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[REDACTED EMAIL]"}),
"LOCATION": OperatorConfig("replace", {"new_value": "[REDACTED PLACE]"}),
},
)
Feature request
The langchain library only supports anonymizing it seems. The native microsoft library can redact data:
I don't see a way to do this via langchain_experimental.data_anonymizer
Motivation
Redaction is cleaner than anonymizing. It's better than replacing first names with gibberish for my. use case
Your contribution
I can help provide use-cases, like i did above.