microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.6k stars 551 forks source link

Please add batching support on HTTP APIs #1045

Open paulo-raca opened 1 year ago

paulo-raca commented 1 year ago

Presidio currently exposes an HTTP (OpenAPI) interface , which is quite good and makes it easy to use Presidio from other programming languages.

However it only allows processing one string at a time.

My use case involves processing structured data (Very similar to Presidios structured data example), and each input results in several of calls to the service.

Describe the solution you'd like

I'd like to have /bulk_analyze, /bulk_anonymize and /bulk_deanonymize. Those would be exactly like their existing counterparts, but receive an array-of-inputs and return an array-of-outputs.

Extra configurations (anonymizers, ad_hoc_recognizers, language, etc) can probably be specified only once

Example:

POST /bulk_analyze
{
    "text": [
        "John Smith drivers license is AC432223 and the zip code is 12345",
        "Hello, I'm John Smith"
    ]
    "language": "en",
    "score_threshold": 0.6,
    "entities": [ ... ],
    "ad_hoc_recognizers": [ ... ],
    "context": [ ... ],
}

200 OK
[
    // First input: "John Smith drivers license is AC432223 and the zip code is 12345"
    [
        { "entity_type": "PERSON", [...] },
        { "entity_type": "US_DRIVER_LICENSE", [...] },
        { "entity_type": "ZIP", [...] }
    ],
    // Second input: "Hello, I'm John Smith"
    [
        { "entity_type": "PERSON", [...] }
    ]
]

Describe alternatives you've considered I'm currently making N calls (concurrently, to reduce latency)

omri374 commented 1 year ago

Thanks, this is a great suggestion and we'll look into it. As always, we'd be happy to review comminuty contributions.