microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.89k stars 579 forks source link

Standalone Analyzer On Spark #242

Closed shayan90 closed 4 years ago

shayan90 commented 5 years ago

Hi,

Have you guys tried to run the analyzer as a spark job ? how would suggest handle loading the spacy model for each worker and also how to handle serialization?

would appreciate some suggestions around this, is there any plan to support this use case ?

Thanks

omri374 commented 5 years ago

Hi, We haven't experienced with this, and I don't think Spacy can run on top of spark (although we haven't tested this). Is deploying Presidio on K8S and calling it from a Spark pipeline an option?

CodeRunRepeat commented 4 years ago

@shayan90 , were you able to consider @omri374 's suggestion?

omri374 commented 4 years ago

Closing for now. Feel free to reopen if you have additional questions/issues