microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.73k stars 567 forks source link

Slow Execution Time When Scanning Large Files #1461

Open vinay-cldscle opened 2 weeks ago

vinay-cldscle commented 2 weeks ago

Hey team, When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?

omri374 commented 1 week ago

Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine option?

vinay-cldscle commented 3 days ago

Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working. analyzer_engine = AnalyzerEngine() analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)

error: results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True) ^^^^^^^^^^^^^^^^ AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'

Batch analyzer works only for list and dict?

omri374 commented 3 days ago

Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator

your text_chunks should be iterable (such as List[str]) and then you could call batch_analyzer.analyze_iter(text_cunks,...)