microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.83k stars 575 forks source link

Sample presidio image evaluation notebook generates error #1251

Open mpsampat opened 10 months ago

mpsampat commented 10 months ago

Describe the bug The notebook https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_dicom_redactor_evaluation.ipynb generates an error and does not provide the evaluation results. the error is shown belown.

To Reproduce Steps to reproduce the behavior:

  1. Git clone the repo:
  2. https://github.com/microsoft/presidio.git
  3. Go to the folder: presidio/docs/samples/python
  4. run the jupyter notebook called "example_dicom_redactor_evaluation.ipynb"
  5. the first few cells work fine.
  6. the cell with this code give the error. the code is:
  7. _, eval_results = dicom_engine.eval_dicom_instance(instance, gt_file_of_interest)
  8. the error i get is shown below:
  9. `--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[9], line 1 ----> 1 _, eval_results = dicom_engine.eval_dicom_instance(instance, gt_file_of_interest)

File /opt/conda/lib/python3.10/site-packages/presidio_image_redactor/dicom_image_pii_verify_engine.py:175, in DicomImagePiiVerifyEngine.eval_dicom_instance(self, instance, ground_truth, padding_width, tolerance, display_image, use_metadata, ocr_kwargs, ad_hoc_recognizers, text_analyzer_kwargs) 165 # Verify detected PHI 166 verify_image, ocr_results, analyzer_results = self.verify_dicom_instance( 167 instance, 168 padding_width, (...) 173 text_analyzer_kwargs, 174 ) --> 175 formatted_ocr_results = self.bbox_processor.get_bboxes_from_ocr_results( 176 ocr_results 177 ) 178 detected_phi = self.bbox_processor.get_bboxes_from_analyzer_results( 179 analyzer_results 180 ) 182 # Remove duplicate entities in results

File /opt/conda/lib/python3.10/site-packages/presidio_image_redactor/bbox.py:18, in BboxProcessor.get_bboxes_from_ocr_results(ocr_results) 12 """Get bounding boxes on padded image for all detected words from ocr_results. 13 14 :param ocr_results: Raw results from OCR. 15 :return: Bounding box information per word. 16 """ 17 bboxes = [] ---> 18 print(ocr_results["text"]) 19 for i in range(len(ocr_results["text"])): 20 detected_text = ocr_results["text"][i]

TypeError: list indices must be integers or slices, not str` Expected behavior

  1. expect to get precision recall as shown in the notebook committed in the repo

Additional context could you please help provide a workaround for this issue. should i use an older tag of presidio ?

mpsampat commented 10 months ago

This issue also exists for other pages such as creating ground truth files page: https://microsoft.github.io/presidio/image-redactor/evaluating_dicom_redaction/#creating-ground-truth-files; the following lines of code generate the error shown below # Format results for more direct comparison ocr_results_formatted = dicom_engine.bbox_processor.get_bboxes_from_ocr_results(ocr_results) analyzer_results_formatted = dicom_engine.bbox_processor.get_bboxes_from_analyzer_results(analyzer_results)

error observed:


TypeError Traceback (most recent call last) Cell In[19], line 1 ----> 1 ocr_results_formatted = dicom_engine.bbox_processor.get_bboxes_from_ocr_results(ocr_results) 2 analyzer_results_formatted = dicom_engine.bbox_processor.get_bboxes_from_analyzer_results(analyzer_results)

File /opt/conda/lib/python3.10/site-packages/presidio_image_redactor/bbox.py:18, in BboxProcessor.get_bboxes_from_ocr_results(ocr_results) 12 """Get bounding boxes on padded image for all detected words from ocr_results. 13 14 :param ocr_results: Raw results from OCR. 15 :return: Bounding box information per word. 16 """ 17 bboxes = [] ---> 18 print(ocr_results["text"]) 19 for i in range(len(ocr_results["text"])): 20 detected_text = ocr_results["text"][i]

omri374 commented 9 months ago

Thank you @mpsampat, and apologies for the delayed response. We'll look into this.

gianni-di-noia commented 4 weeks ago

The method verify_dicom_instance already returns formatted ocr_results . The variable is called ocr_bboxes in the codebase.