microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.7k stars 565 forks source link

Use group from matched pattern (PatternRecognizer) #1120

Open didmar opened 1 year ago

didmar commented 1 year ago

Is your feature request related to a problem? Please describe. In PatternRecognizer, it is not possible to use the span of a given group in a pattern, rather than using the entire span. For example using regex "Password: (\w+)", PatternRecognizer will anonymize "Password: 1234", but I would like it to only anonymize "1234"

Describe the solution you'd like A way to specify a group to use, e.g., start, end = match.span(pattern.use_group) if pattern.use_group else match.span(). Can do a PR if needed.

omri374 commented 1 year ago

Hi, yes this is indeed something that could be improved and a PR would be great. I believe it is a duplicate of #739

marcjulianschwarz commented 11 months ago

Hi @didmar have you started working on this? If not, I'd like to give it a go.

didmar commented 11 months ago

Hi @marcjulianschwarz, I ended up creating my own PatternRecognizer class to address this and make other tweaks for my use case, so not something that would be interesting as a PR I believe.

For reference, I simply changed one line the __analyze_patterns method like so:

    ...
    for match in matches:
        # Modified here to use captured group
        start, end = match.span(1)
        ...

Also check out the duplicate #739, which suggests a more general way to handle this.