microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.61k stars 552 forks source link

deanonymize(anonymize(text)) != text #1151

Open zizhong opened 1 year ago

zizhong commented 1 year ago

Describe the bug deanonymize(anonymize(text)) != text

To Reproduce Steps to reproduce the behavior:

  1. Use a transformer model obi/deid_roberta_i2b2 as analyzer
  2. the text is a medical license number MED-123456
  3. the anonymize() will return a medical license number <ORGANIZATION><ID><US_DRIVER_LICENSE>. The \<item> is the base64 encoded encrypted item.
  4. the deanonymizer will return a medical license number MED-123123456

Expected behavior deanonymize(anonymize(text)) == text

omri374 commented 1 year ago

Hi @zizhong, thanks for reporting this. Would you mind adding the analyzer and anonymizer full results?

zizhong commented 1 year ago

@omri374 My pleasure!

Original text:

May 5, 2023
Name: Carl John Smith
DOB: 04/18/1985
SSN: 999-99-9999
Dear DDS Examiner:
Introduction:
Mr. Carl Smith is a 31-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. Smith says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, Carl is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. Carl is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, Carl responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is Gavin and I plan to go to San Francisco later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Gavin,

Zizhong Ye and Gordon Liu are schoolmates at Chadbroune Elementry School.

Here are a few example sentences we currently support:

Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.

John Smith called Sarah Jane at 321-456-7098 and told her to meet him at 1112 Market Street

During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided me with his personal details. His email is johndoe@example.com and his contact number is 650-456-7890. He lives in New York City, USA, and belongs to the American nationality with Christian beliefs and a leaning towards the Democratic party. He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 and has a medical license number MED-123456.

key: 16charEncryptKey16charEncryptKey

Analysis results:  [type: DATE_TIME, start: 1, end: 7, score: 1.0, type: PERSON, start: 19, end: 28, score: 1.0, type: PERSON, start: 29, end: 34, score: 1.0, type: PERSON, start: 105, end: 109, score: 1.0, type: PERSON, start: 110, end: 115, score: 1.0, type: AGE, start: 121, end: 123, score: 1.0, type: PERSON, start: 215, end: 220, score: 1.0, type: PERSON, start: 539, end: 543, score: 1.0, type: PERSON, start: 709, end: 713, score: 1.0, type: PERSON, start: 1077, end: 1081, score: 1.0, type: LOCATION, start: 1221, end: 1224, score: 1.0, type: LOCATION, start: 1225, end: 1234, score: 1.0, type: PERSON, start: 1371, end: 1377, score: 1.0, type: PERSON, start: 1387, end: 1389, score: 1.0, type: PERSON, start: 1394, end: 1400, score: 1.0, type: PERSON, start: 1401, end: 1404, score: 1.0, type: PERSON, start: 1528, end: 1533, score: 1.0, type: PERSON, start: 1534, end: 1541, score: 1.0, type: LOCATION, start: 1556, end: 1561, score: 1.0, type: CREDIT_CARD, start: 1588, end: 1607, score: 1.0, type: CRYPTO, start: 1635, end: 1669, score: 1.0, type: DATE_TIME, start: 1675, end: 1684, score: 1.0, type: DATE_TIME, start: 1685, end: 1687, score: 1.0, type: EMAIL_ADDRESS, start: 1733, end: 1751, score: 1.0, type: PHONE_NUMBER, start: 1824, end: 1829, score: 1.0, type: PHONE_NUMBER, start: 1830, end: 1838, score: 1.0, type: IBAN_CODE, start: 1892, end: 1915, score: 1.0, type: PERSON, start: 2066, end: 2070, score: 1.0, type: PERSON, start: 2071, end: 2076, score: 1.0, type: PERSON, start: 2084, end: 2089, score: 1.0, type: PERSON, start: 2090, end: 2094, score: 1.0, type: UK_NHS, start: 2098, end: 2110, score: 1.0, type: PHONE_NUMBER, start: 2098, end: 2101, score: 1.0, type: LOCATION, start: 2151, end: 2157, score: 1.0, type: DATE_TIME, start: 2188, end: 2200, score: 1.0, type: PERSON, start: 2220, end: 2224, score: 1.0, type: PERSON, start: 2225, end: 2228, score: 1.0, type: EMAIL_ADDRESS, start: 2281, end: 2300, score: 1.0, type: PHONE_NUMBER, start: 2327, end: 2330, score: 1.0, type: LOCATION, start: 2353, end: 2361, score: 1.0, type: LOCATION, start: 2362, end: 2366, score: 1.0, type: LOCATION, start: 2368, end: 2371, score: 1.0, type: CREDIT_CARD, start: 2551, end: 2570, score: 1.0, type: CRYPTO, start: 2618, end: 2652, score: 1.0, type: IBAN_CODE, start: 2719, end: 2746, score: 1.0, type: PERSON, start: 2819, end: 2823, score: 1.0, type: PHONE_NUMBER, start: 3200, end: 3203, score: 1.0, type: PERSON, start: 1195, end: 1200, score: 0.9900000095367432, type: PHONE_NUMBER, start: 1588, end: 1592, score: 0.9900000095367432, type: PHONE_NUMBER, start: 1766, end: 1769, score: 0.9900000095367432, type: LOCATION, start: 2139, end: 2150, score: 0.9900000095367432, type: DATE_TIME, start: 2201, end: 2205, score: 0.9900000095367432, type: ORGANIZATION, start: 2473, end: 2478, score: 0.9900000095367432, type: LOCATION, start: 2851, end: 2853, score: 0.9900000095367432, type: DATE_TIME, start: 8, end: 12, score: 0.9800000190734863, type: PHONE_NUMBER, start: 1774, end: 1775, score: 0.9800000190734863, type: PHONE_NUMBER, start: 2551, end: 2565, score: 0.9800000190734863, type: DATE_TIME, start: 40, end: 43, score: 0.9700000286102295, type: PHONE_NUMBER, start: 2101, end: 2105, score: 0.9700000286102295, type: ORGANIZATION, start: 1445, end: 1451, score: 0.9599999785423279, type: EMAIL, start: 2281, end: 2282, score: 0.9599999785423279, type: IP_ADDRESS, start: 1766, end: 1777, score: 0.95, type: URL, start: 2789, end: 2817, score: 0.95, type: IP_ADDRESS, start: 3200, end: 3211, score: 0.95, type: PHONE_NUMBER, start: 2974, end: 2977, score: 0.949999988079071, type: PHONE_NUMBER, start: 3108, end: 3116, score: 0.9399999976158142, type: PHONE_NUMBER, start: 2977, end: 2983, score: 0.9300000071525574, type: PHONE_NUMBER, start: 3105, end: 3108, score: 0.9100000262260437, type: ORGANIZATION, start: 2289, end: 2296, score: 0.8999999761581421, type: PHONE_NUMBER, start: 1770, end: 1773, score: 0.8899999856948853, type: PHONE_NUMBER, start: 2330, end: 2339, score: 0.8799999952316284, type: PHONE_NUMBER, start: 1592, end: 1607, score: 0.8600000143051147, type: ORGANIZATION, start: 2462, end: 2472, score: 0.8600000143051147, type: ORGANIZATION, start: 1424, end: 1434, score: 0.8500000238418579, type: US_SSN, start: 2014, end: 2025, score: 0.85, type: US_ITIN, start: 2974, end: 2985, score: 0.85, type: US_SSN, start: 3105, end: 3116, score: 0.85, type: PHONE_NUMBER, start: 2106, end: 2110, score: 0.8199999928474426, type: ORGANIZATION, start: 3245, end: 3248, score: 0.8100000023841858, type: PHONE_NUMBER, start: 2566, end: 2569, score: 0.7900000214576721, type: PHONE_NUMBER, start: 2729, end: 2738, score: 0.7900000214576721, type: PHONE_NUMBER, start: 2015, end: 2025, score: 0.7699999809265137, type: PERSON, start: 1378, end: 1379, score: 0.7599999904632568, type: PERSON, start: 1377, end: 1378, score: 0.75, type: PHONE_NUMBER, start: 1824, end: 1838, score: 1.0, type: PHONE_NUMBER, start: 2014, end: 2025, score: 0.7699999809265137, type: PHONE_NUMBER, start: 2327, end: 2339, score: 1.0, type: PERSON, start: 2284, end: 2288, score: 0.7400000095367432, type: ORGANIZATION, start: 1435, end: 1444, score: 0.7200000286102295, type: LOCATION, start: 2392, end: 2400, score: 0.7200000286102295, type: DATE_TIME, start: 43, end: 50, score: 0.699999988079071, type: PERSON, start: 2282, end: 2284, score: 0.6899999976158142, type: ORGANIZATION, start: 1698, end: 1703, score: 0.6700000166893005, type: ORGANIZATION, start: 2800, end: 2801, score: 0.6700000166893005, type: US_DRIVER_LICENSE, start: 2054, end: 2062, score: 0.6499999999999999, type: US_DRIVER_LICENSE, start: 2951, end: 2960, score: 0.6499999999999999, type: PERSON, start: 1379, end: 1386, score: 0.6499999761581421, type: DATE_TIME, start: 40, end: 50, score: 0.9700000286102295, type: PHONE_NUMBER, start: 2569, end: 2570, score: 0.5799999833106995, type: OTHERPHI, start: 1703, end: 1707, score: 0.5099999904632568, type: PERSON, start: 1981, end: 1985, score: 0.5099999904632568, type: US_ITIN, start: 56, end: 67, score: 0.5, type: URL, start: 1698, end: 1711, score: 0.5, type: URL, start: 1738, end: 1749, score: 0.5, type: URL, start: 2289, end: 2300, score: 0.5, type: ID, start: 2744, end: 2746, score: 0.5, type: PHONE_NUMBER, start: 62, end: 63, score: 0.49000000953674316, type: ID, start: 1793, end: 1799, score: 0.49, type: ID, start: 1892, end: 1909, score: 0.49, type: ID, start: 2907, end: 2920, score: 0.49, type: ID, start: 56, end: 62, score: 0.48, type: ID, start: 2014, end: 2015, score: 0.48, type: ID, start: 1966, end: 1976, score: 0.46, type: ID, start: 3049, end: 3056, score: 0.46, type: ID, start: 2054, end: 2059, score: 0.45, type: ID, start: 1635, end: 1644, score: 0.44, type: ID, start: 2719, end: 2726, score: 0.44, type: ID, start: 2739, end: 2743, score: 0.43, type: US_PASSPORT, start: 1793, end: 1802, score: 0.4, type: US_BANK_NUMBER, start: 1966, end: 1978, score: 0.4, type: PHONE_NUMBER, start: 2098, end: 2110, score: 1.0, type: ID, start: 2727, end: 2728, score: 0.4, type: US_BANK_NUMBER, start: 2907, end: 2923, score: 0.4, type: ID, start: 2951, end: 2955, score: 0.4, type: US_PASSPORT, start: 3049, end: 3058, score: 0.4, type: US_DRIVER_LICENSE, start: 3249, end: 3255, score: 0.4, type: ID, start: 1650, end: 1655, score: 0.39, type: ID, start: 2955, end: 2960, score: 0.39, type: ID, start: 2650, end: 2652, score: 0.38, type: DATE_TIME, start: 2789, end: 2794, score: 0.36000001430511475, type: ID, start: 3056, end: 3058, score: 0.35, type: ID, start: 3248, end: 3252, score: 0.34, type: ID, start: 1913, end: 1915, score: 0.32, type: ID, start: 1976, end: 1978, score: 0.32, type: ID, start: 2626, end: 2628, score: 0.32, type: ID, start: 2983, end: 2985, score: 0.32, type: PERSON, start: 2798, end: 2800, score: 0.3100000023841858, type: ID, start: 2618, end: 2621, score: 0.31, type: ID, start: 2628, end: 2629, score: 0.31, type: ID, start: 2643, end: 2646, score: 0.31, type: PHONE_NUMBER, start: 1776, end: 1777, score: 0.30000001192092896, type: ORGANIZATION, start: 2797, end: 2798, score: 0.30000001192092896, type: ID, start: 2629, end: 2631, score: 0.3, type: ID, start: 2726, end: 2727, score: 0.3, type: ID, start: 2640, end: 2641, score: 0.28, type: ID, start: 2641, end: 2643, score: 0.28, type: ID, start: 1799, end: 1802, score: 0.27, type: ID, start: 2632, end: 2638, score: 0.26, type: OTHERPHI, start: 1708, end: 1711, score: 0.23000000417232513, type: ID, start: 2638, end: 2640, score: 0.21, type: ID, start: 63, end: 67, score: 0.2, type: ID, start: 2621, end: 2625, score: 0.19, type: ID, start: 2625, end: 2626, score: 0.16, type: ID, start: 2920, end: 2923, score: 0.16, type: US_PASSPORT, start: 2951, end: 2960, score: 0.1, type: US_BANK_NUMBER, start: 1793, end: 1802, score: 0.05, type: US_SSN, start: 1793, end: 1802, score: 0.05, type: US_BANK_NUMBER, start: 3049, end: 3058, score: 0.05, type: US_DRIVER_LICENSE, start: 1793, end: 1802, score: 0.01, type: US_DRIVER_LICENSE, start: 1966, end: 1978, score: 0.01, type: US_DRIVER_LICENSE, start: 2907, end: 2923, score: 0.01, type: US_DRIVER_LICENSE, start: 3049, end: 3058, score: 0.01]

sanitized_results:
text:

/oSOg6iCSSvrWeZlXxu68BOeKmiTzcNzQsnJGhBuE14= BXBJ6eCU59a5nGvzGtXkVd5oOjJWZ3606NWi6vUgna8=
Name: N4/k/tVfrGcIMHuiEeB4tzn1OPnvfqItq2GsaYL6DzE= ph8f+GFGdzb0kJ7jtupBDHhQmRah/peKV/UgXXEJxxQ=
DOB: doSS1fEZlEXjD/4dpBBgX9AfDo1MBQ6a9LIQmuBM/Zs=
SSN: 2HhEMucehDL/N9PB25Give8hbskDdkX6PKRVbbmBy3c=
Dear DDS Examiner:
Introduction:
Mr. xqYsVNVNr18ennd01WUFwd7uN6H2VMU4ciOoEG0WctI= /VUZ38hgaW8oIOqXKO/V5rhRJapPgYksLqPWPYfsabI= is a D/VKKs+lEPKpi0u8sM4GrCgFl5iRa8DYA0X6gj2D5WA=-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. NliXY4ki34IfIbzZtjE3uNftlnT32WVvoyJNayCdekY= says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, VsABjcnQqmUm/j03n4MKg2DqpFCr4pqITtmMifENZeE= is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. lBruvd9kF+rvdor093uxwhDtSKL/UK55A3DI+oSywtE= is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, 9ck+Cm18StxyLGQyKNvC2jBXJmrMpWU4sB8ZrFU1kAM= responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is BKa65ekjqE4WQItErmVhMA/2OOOJN22KHfjgvsCa8so= and I plan to go to s4EvJlsKlpLYKD0zpGUdfft9ShuIEhrPzDzH7jSYEts= jCm9dzARnqHI0iJKC5OMieNLge4kdoVGm8grvb3YlAI= later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Sc9T66XdTiYZ67ZsDXtIt61RjH3Ix4bmDrQzlzHrMU0=WyuALmGVkddrBmBg1hT/y2A5j9xhPrNZ1Ej9CLwbIhg=vDR2T7oK/yvou0saRKzPv5lYKzglBLfi6X0eIFYBJJo=6O/l3Kvxs0LqR9MacXjsndvYIwJy0amzv0DXByXElw8= pjeruxF2mmDdAV0TBTRzKVln6mAyJmq0G/WmQCXY5X0= and oRIwSVmhzsIRUUQdiBq9EG1nNY3jBVaF/rzNY3CeohM= MkXL66jTGCSYWjiLw+SmxnwXt0KbnQqtPkEtDeHBaZ8= are schoolmates at IeA7j0GO0zQX8wQmJkzW76yvRON8t3RWOZDO9FAigYs= dcr2T5rcxmVFxJJb27qkCbuBOOOPLGj8okogyjFxvCE= S3gKzkLnRIFWv1KvmwjDjAs578Ss/P46Y5QNqLO+mJI=.

Here are a few example sentences we currently support:

Hello, my name is BaOX4t4zf6ifgm2ynleYWx2zqI4rAqZwRfVfd5mymg8= jL/Ow3JsVqej2de+JruDmXcVImHsw2h8KEXyAwntAaw= and I live in h5T1VzxIeESGFf0Vwb3TNh0+FvuQhurUu9OVzmgfp9M=.
My credit card number is NrC5Fm+X1XsvO4ni9B1efz3eGXBUpGIja5qUJs4eJKIzWUgvexzrLDkdn1c2h8Vq and my crypto wallet id is xkDFwVdeQ0TFnoW/5WVyKnLbYTesROeB/XBYRGyOQ4Mjz7l0qgpr3DNdUB3CNGsNnfSWC2AHJjSuGQ0V21X9zA==.

On tH4kBVVvfUtQX3YMoNzYyBtyBJIK9Sg+iyWs9kg5ogw= v/5HfMn1UGlK/AMUsbUJ70kwGKg4CA+WvT8MVX8p3rI= I visited 6jafevKfBhz1CVvui9Wvk8t3BFyF18TUsSlPaEaM/IU= and sent an email to hcVH55QTg18VacjpzPcpZ0aIPONprLSNhaYmeZ2IbEOP9/mg/vPTgt7/z5v821iV,  from the IP NuUW3IpNC1Sg6HeluMAuVGa6u1Dsvfg0BZRUKm/l0l0=.

My passport: 9I/qaxhEhair6rHqgtFxlXMWz928SATTrdfJPr0fsmg= and my phone number: 9XyKeczSYzOLypCejq4vx2wb3Oac94XTodujyIyTM5E=.

This is a valid International Bank Account Number: dChBF5PuA8kcMX+ad/Hb/E57lFjvSUgvt/LwegwJKNtUxShlWKmp7vXMSVD3Ny2N . Can you please check the status on bank account qSApTPPKEfvyf9ttwBHxvR7Cwus/fnxLOY5okVAhSWg=?

jNTpQsZUHYTaFJsk8OsqQtIVhGkyw3f3IxRwgTabyKE='s social security number is v5b0SKTg7lFN2CC3BU+IDlNIQ6OD1RndYHbD4PkdweI=.  Her driver license? it is z2Gd0dTlduKZgusZZrm+E2wCi6XddWWR96QwgJjr6Pc=.

+EYE4N6zxuhpgT9dAdnoOEo1ck6FKX3u0DjH+axfNvs= KheWSsytZm/hc1MLoumGJNBIpykcegMJy1OzRuo8t0g= called 1AY6x7lB+gEkEOhEDO38qlKA0ZBvjcJeDBEFoXbo5MA= zm6PpJ1hQwPXKL5+kJblxJxUsOxnoDvR5c2bhDIag9M= at xwpVKZjnLWV6hktpTRAiyinDAyRvOXfsW1Tg9mvV7HI= and told her to meet him at tGQgAzg7BsNc04azpaVfL6RBbe0mmcSL9/ThFmXXEi8= 1XuhIBu/IO9l08LiItzv+PweW5qQOfvZZO1iIc5EYpU=

During our recent meeting on hbi15cSCVRclpEAaJw3DLcNokTF3ay1VYCu7ybJOVhE= 8HlY+yBPE8vadGocrI38aGuJFw6FoOoj2QmlRi+3DtQ=, at 10:30 AM, 7tFViRtxe5BchoD4nEIVSpYuM5mU0lJQLzW6QXxyCq0= MWrvdw1m3gbR16/rp0JPHduUB5sOpng9uo2/6n1CuCA= provided me with his personal details. His email is oVi4cdXrSs26rjglrmsEOIILOsCYhAyIapd8By4ZLIuVf2BLazMvLNDVmSWfrjUU and his contact number is OIoGnYJJKjxN8RL6DW7vzc6oKn/X9z6c60iFX87uBaY=. He lives in F1ZO9fpP8Zkclxkriwy1+xDSWlrACdAM8SgvvR2lz8o= enmAbh33dzbXygPTLVWTeTTT0tEDZ6WsIhyzanx/iUs=, EiLAs79xWju6oeJoFKLs2eTbcxSzeVl615wK2sAs/nI=, and belongs to the FuXlfgMW3OMA02MUoP3n+kvSoMxRpq+RleU6+4iBNEA= nationality with Christian beliefs and a leaning towards the YOV6XhhFbx3ZqxVl/4vTUYd/tswrOCwmvxU4pdnSJm8= gUSPxSEM70gTyMgO/Y4BuhCVzmXP70wTHVrZohzIuY4=. He mentioned that he recently made a transaction using his credit card PfDnKDp8kXwQZBDHS9O6zh+JFSCT2lqIAK+H7A6m7q4WS3qZUy0ZfoVjUvQFzj6a and transferred bitcoins to the wallet address fvh22oXJ70PkniEc+lamum+NlRFA9N0sjb4+azxrOLRc2H/ZOCiA97/Uaazz4FOgETZNhe1CKpwQWG5QpgdHxA==. While discussing his European travels, he noted down his IBAN as tfMNvU/H0GNfpg+L4maWqNx4cNEBiUTj6OLneia1DFa/jERz88GZ9Fzx8GIGONLx. Additionally, he provided his website as adBZKm3695s64cOXz+ZTdV/idRt/ag9q323/Os9jRBYvMElZt024Ut8nTReueC2L. 35HBMp9dKm0gOCONK7+d88JBKWwTnNVFy+mJ9ImRKl8= also discussed some of his TpRc7LDRHcwMPRRXm2NNNUNUae1RL7p6vxFqQPrE8Ko=-specific details. He said his bank account number is xUlkzkuVhON7kJGLbXViSpoC9phr41g8tm93l/H1jHMfwp87lubfiz5Yzr7sKyLN and his drivers license is la252+9F+ZmzPPeJO8XHT+tQaiBl9ypeb+b7/qiA4fo=. His ITIN is /2uD/S1LvqhXJkjjieKNbIwcyB6gwRvpbNs1leIh6LI=, and he recently renewed his passport, the number for which is 9HdRadE311qIMDzFAE77QNrt1kZnniYb0NYRHpMoseI=. He emphasized not to share his SSN, which is pIsW+gc75pCKMJOrVK5c4+v0MrOfkjGeYeFQXaJlKto=. Furthermore, he mentioned that he accesses his work files remotely through the IP hyR4W4fanUhn3FOZgpWOwHr6EZibPlzU2jeAOesbhyI= and has a medical license number c0Ooguq0cTCKtNJYMe0y5tuU9GW7puSkbxugbu1pvKA=kyLs2EJZA9yV41kqJwUQZj4NcFgXE6SY533sIXlNBJ4=Jc6WwHcM3QUw9ZMPFRv6xae6OvQRoDytLls16zyOvwQ=.

items:

[
    {'start': 6172, 'end': 6216, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'Jc6WwHcM3QUw9ZMPFRv6xae6OvQRoDytLls16zyOvwQ=', 'operator': 'encrypt'},
    {'start': 6128, 'end': 6172, 'entity_type': 'ID', 'text': 'kyLs2EJZA9yV41kqJwUQZj4NcFgXE6SY533sIXlNBJ4=', 'operator': 'encrypt'},
    {'start': 6084, 'end': 6128, 'entity_type': 'ORGANIZATION', 'text': 'c0Ooguq0cTCKtNJYMe0y5tuU9GW7puSkbxugbu1pvKA=', 'operator': 'encrypt'},
    {'start': 6006, 'end': 6050, 'entity_type': 'IP_ADDRESS', 'text': 'hyR4W4fanUhn3FOZgpWOwHr6EZibPlzU2jeAOesbhyI=', 'operator': 'encrypt'},
    {'start': 5878, 'end': 5922, 'entity_type': 'US_SSN', 'text': 'pIsW+gc75pCKMJOrVK5c4+v0MrOfkjGeYeFQXaJlKto=', 'operator': 'encrypt'},
    {'start': 5787, 'end': 5831, 'entity_type': 'US_PASSPORT', 'text': '9HdRadE311qIMDzFAE77QNrt1kZnniYb0NYRHpMoseI=', 'operator': 'encrypt'},
    {'start': 5679, 'end': 5723, 'entity_type': 'US_ITIN', 'text': '/2uD/S1LvqhXJkjjieKNbIwcyB6gwRvpbNs1leIh6LI=', 'operator': 'encrypt'},
    {'start': 5621, 'end': 5665, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'la252+9F+ZmzPPeJO8XHT+tQaiBl9ypeb+b7/qiA4fo=', 'operator': 'encrypt'},
    {'start': 5529, 'end': 5593, 'entity_type': 'US_BANK_NUMBER', 'text': 'xUlkzkuVhON7kJGLbXViSpoC9phr41g8tm93l/H1jHMfwp87lubfiz5Yzr7sKyLN', 'operator': 'encrypt'},
    {'start': 5431, 'end': 5475, 'entity_type': 'LOCATION', 'text': 'TpRc7LDRHcwMPRRXm2NNNUNUae1RL7p6vxFqQPrE8Ko=', 'operator': 'encrypt'},
    {'start': 5359, 'end': 5403, 'entity_type': 'PERSON', 'text': '35HBMp9dKm0gOCONK7+d88JBKWwTnNVFy+mJ9ImRKl8=', 'operator': 'encrypt'},
    {'start': 5293, 'end': 5357, 'entity_type': 'URL', 'text': 'adBZKm3695s64cOXz+ZTdV/idRt/ag9q323/Os9jRBYvMElZt024Ut8nTReueC2L', 'operator': 'encrypt'},
    {'start': 5186, 'end': 5250, 'entity_type': 'IBAN_CODE', 'text': 'tfMNvU/H0GNfpg+L4maWqNx4cNEBiUTj6OLneia1DFa/jERz88GZ9Fzx8GIGONLx', 'operator': 'encrypt'},
    {'start': 5031, 'end': 5119, 'entity_type': 'CRYPTO', 'text': 'fvh22oXJ70PkniEc+lamum+NlRFA9N0sjb4+azxrOLRc2H/ZOCiA97/Uaazz4FOgETZNhe1CKpwQWG5QpgdHxA==', 'operator': 'encrypt'},
    {'start': 4919, 'end': 4983, 'entity_type': 'CREDIT_CARD', 'text': 'PfDnKDp8kXwQZBDHS9O6zh+JFSCT2lqIAK+H7A6m7q4WS3qZUy0ZfoVjUvQFzj6a', 'operator': 'encrypt'},
    {'start': 4802, 'end': 4846, 'entity_type': 'ORGANIZATION', 'text': 'gUSPxSEM70gTyMgO/Y4BuhCVzmXP70wTHVrZohzIuY4=', 'operator': 'encrypt'},
    {'start': 4757, 'end': 4801, 'entity_type': 'ORGANIZATION', 'text': 'YOV6XhhFbx3ZqxVl/4vTUYd/tswrOCwmvxU4pdnSJm8=', 'operator': 'encrypt'},
    {'start': 4651, 'end': 4695, 'entity_type': 'LOCATION', 'text': 'FuXlfgMW3OMA02MUoP3n+kvSoMxRpq+RleU6+4iBNEA=', 'operator': 'encrypt'},
    {'start': 4586, 'end': 4630, 'entity_type': 'LOCATION', 'text': 'EiLAs79xWju6oeJoFKLs2eTbcxSzeVl615wK2sAs/nI=', 'operator': 'encrypt'},
    {'start': 4540, 'end': 4584, 'entity_type': 'LOCATION', 'text': 'enmAbh33dzbXygPTLVWTeTTT0tEDZ6WsIhyzanx/iUs=', 'operator': 'encrypt'},
    {'start': 4495, 'end': 4539, 'entity_type': 'LOCATION', 'text': 'F1ZO9fpP8Zkclxkriwy1+xDSWlrACdAM8SgvvR2lz8o=', 'operator': 'encrypt'},
    {'start': 4437, 'end': 4481, 'entity_type': 'PHONE_NUMBER', 'text': 'OIoGnYJJKjxN8RL6DW7vzc6oKn/X9z6c60iFX87uBaY=', 'operator': 'encrypt'},
    {'start': 4346, 'end': 4410, 'entity_type': 'EMAIL_ADDRESS', 'text': 'oVi4cdXrSs26rjglrmsEOIILOsCYhAyIapd8By4ZLIuVf2BLazMvLNDVmSWfrjUU', 'operator': 'encrypt'},
    {'start': 4249, 'end': 4293, 'entity_type': 'PERSON', 'text': 'MWrvdw1m3gbR16/rp0JPHduUB5sOpng9uo2/6n1CuCA=', 'operator': 'encrypt'},
    {'start': 4204, 'end': 4248, 'entity_type': 'PERSON', 'text': '7tFViRtxe5BchoD4nEIVSpYuM5mU0lJQLzW6QXxyCq0=', 'operator': 'encrypt'},
    {'start': 4145, 'end': 4189, 'entity_type': 'DATE_TIME', 'text': '8HlY+yBPE8vadGocrI38aGuJFw6FoOoj2QmlRi+3DtQ=', 'operator': 'encrypt'},
    {'start': 4100, 'end': 4144, 'entity_type': 'DATE_TIME', 'text': 'hbi15cSCVRclpEAaJw3DLcNokTF3ay1VYCu7ybJOVhE=', 'operator': 'encrypt'},
    {'start': 4025, 'end': 4069, 'entity_type': 'LOCATION', 'text': '1XuhIBu/IO9l08LiItzv+PweW5qQOfvZZO1iIc5EYpU=', 'operator': 'encrypt'},
    {'start': 3980, 'end': 4024, 'entity_type': 'LOCATION', 'text': 'tGQgAzg7BsNc04azpaVfL6RBbe0mmcSL9/ThFmXXEi8=', 'operator': 'encrypt'},
    {'start': 3907, 'end': 3951, 'entity_type': 'PHONE_NUMBER', 'text': 'xwpVKZjnLWV6hktpTRAiyinDAyRvOXfsW1Tg9mvV7HI=', 'operator': 'encrypt'},
    {'start': 3859, 'end': 3903, 'entity_type': 'PERSON', 'text': 'zm6PpJ1hQwPXKL5+kJblxJxUsOxnoDvR5c2bhDIag9M=', 'operator': 'encrypt'},
    {'start': 3814, 'end': 3858, 'entity_type': 'PERSON', 'text': '1AY6x7lB+gEkEOhEDO38qlKA0ZBvjcJeDBEFoXbo5MA=', 'operator': 'encrypt'},
    {'start': 3762, 'end': 3806, 'entity_type': 'PERSON', 'text': 'KheWSsytZm/hc1MLoumGJNBIpykcegMJy1OzRuo8t0g=', 'operator': 'encrypt'},
    {'start': 3717, 'end': 3761, 'entity_type': 'PERSON', 'text': '+EYE4N6zxuhpgT9dAdnoOEo1ck6FKX3u0DjH+axfNvs=', 'operator': 'encrypt'},
    {'start': 3669, 'end': 3713, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'z2Gd0dTlduKZgusZZrm+E2wCi6XddWWR96QwgJjr6Pc=', 'operator': 'encrypt'},
    {'start': 3596, 'end': 3640, 'entity_type': 'US_SSN', 'text': 'v5b0SKTg7lFN2CC3BU+IDlNIQ6OD1RndYHbD4PkdweI=', 'operator': 'encrypt'},
    {'start': 3523, 'end': 3567, 'entity_type': 'PERSON', 'text': 'jNTpQsZUHYTaFJsk8OsqQtIVhGkyw3f3IxRwgTabyKE=', 'operator': 'encrypt'},
    {'start': 3476, 'end': 3520, 'entity_type': 'US_BANK_NUMBER', 'text': 'qSApTPPKEfvyf9ttwBHxvR7Cwus/fnxLOY5okVAhSWg=', 'operator': 'encrypt'},
    {'start': 3361, 'end': 3425, 'entity_type': 'IBAN_CODE', 'text': 'dChBF5PuA8kcMX+ad/Hb/E57lFjvSUgvt/LwegwJKNtUxShlWKmp7vXMSVD3Ny2N', 'operator': 'encrypt'},
    {'start': 3263, 'end': 3307, 'entity_type': 'PHONE_NUMBER', 'text': '9XyKeczSYzOLypCejq4vx2wb3Oac94XTodujyIyTM5E=', 'operator': 'encrypt'},
    {'start': 3197, 'end': 3241, 'entity_type': 'US_PASSPORT', 'text': '9I/qaxhEhair6rHqgtFxlXMWz928SATTrdfJPr0fsmg=', 'operator': 'encrypt'},
    {'start': 3137, 'end': 3181, 'entity_type': 'IP_ADDRESS', 'text': 'NuUW3IpNC1Sg6HeluMAuVGa6u1Dsvfg0BZRUKm/l0l0=', 'operator': 'encrypt'},
    {'start': 3058, 'end': 3122, 'entity_type': 'EMAIL_ADDRESS', 'text': 'hcVH55QTg18VacjpzPcpZ0aIPONprLSNhaYmeZ2IbEOP9/mg/vPTgt7/z5v821iV', 'operator': 'encrypt'},
    {'start': 2992, 'end': 3036, 'entity_type': 'URL', 'text': '6jafevKfBhz1CVvui9Wvk8t3BFyF18TUsSlPaEaM/IU=', 'operator': 'encrypt'},
    {'start': 2937, 'end': 2981, 'entity_type': 'DATE_TIME', 'text': 'v/5HfMn1UGlK/AMUsbUJ70kwGKg4CA+WvT8MVX8p3rI=', 'operator': 'encrypt'},
    {'start': 2892, 'end': 2936, 'entity_type': 'DATE_TIME', 'text': 'tH4kBVVvfUtQX3YMoNzYyBtyBJIK9Sg+iyWs9kg5ogw=', 'operator': 'encrypt'},
    {'start': 2798, 'end': 2886, 'entity_type': 'CRYPTO', 'text': 'xkDFwVdeQ0TFnoW/5WVyKnLbYTesROeB/XBYRGyOQ4Mjz7l0qgpr3DNdUB3CNGsNnfSWC2AHJjSuGQ0V21X9zA==', 'operator': 'encrypt'},
    {'start': 2706, 'end': 2770, 'entity_type': 'CREDIT_CARD', 'text': 'NrC5Fm+X1XsvO4ni9B1efz3eGXBUpGIja5qUJs4eJKIzWUgvexzrLDkdn1c2h8Vq', 'operator': 'encrypt'},
    {'start': 2635, 'end': 2679, 'entity_type': 'LOCATION', 'text': 'h5T1VzxIeESGFf0Vwb3TNh0+FvuQhurUu9OVzmgfp9M=', 'operator': 'encrypt'},
    {'start': 2576, 'end': 2620, 'entity_type': 'PERSON', 'text': 'jL/Ow3JsVqej2de+JruDmXcVImHsw2h8KEXyAwntAaw=', 'operator': 'encrypt'},
    {'start': 2531, 'end': 2575, 'entity_type': 'PERSON', 'text': 'BaOX4t4zf6ifgm2ynleYWx2zqI4rAqZwRfVfd5mymg8=', 'operator': 'encrypt'},
    {'start': 2410, 'end': 2454, 'entity_type': 'ORGANIZATION', 'text': 'S3gKzkLnRIFWv1KvmwjDjAs578Ss/P46Y5QNqLO+mJI=', 'operator': 'encrypt'},
    {'start': 2365, 'end': 2409, 'entity_type': 'ORGANIZATION', 'text': 'dcr2T5rcxmVFxJJb27qkCbuBOOOPLGj8okogyjFxvCE=', 'operator': 'encrypt'},
    {'start': 2320, 'end': 2364, 'entity_type': 'ORGANIZATION', 'text': 'IeA7j0GO0zQX8wQmJkzW76yvRON8t3RWOZDO9FAigYs=', 'operator': 'encrypt'},
    {'start': 2256, 'end': 2300, 'entity_type': 'PERSON', 'text': 'MkXL66jTGCSYWjiLw+SmxnwXt0KbnQqtPkEtDeHBaZ8=', 'operator': 'encrypt'},
    {'start': 2211, 'end': 2255, 'entity_type': 'PERSON', 'text': 'oRIwSVmhzsIRUUQdiBq9EG1nNY3jBVaF/rzNY3CeohM=', 'operator': 'encrypt'},
    {'start': 2162, 'end': 2206, 'entity_type': 'PERSON', 'text': 'pjeruxF2mmDdAV0TBTRzKVln6mAyJmq0G/WmQCXY5X0=', 'operator': 'encrypt'},
    {'start': 2117, 'end': 2161, 'entity_type': 'PERSON', 'text': '6O/l3Kvxs0LqR9MacXjsndvYIwJy0amzv0DXByXElw8=', 'operator': 'encrypt'},
    {'start': 2073, 'end': 2117, 'entity_type': 'PERSON', 'text': 'vDR2T7oK/yvou0saRKzPv5lYKzglBLfi6X0eIFYBJJo=', 'operator': 'encrypt'},
    {'start': 2029, 'end': 2073, 'entity_type': 'PERSON', 'text': 'WyuALmGVkddrBmBg1hT/y2A5j9xhPrNZ1Ej9CLwbIhg=', 'operator': 'encrypt'},
    {'start': 1985, 'end': 2029, 'entity_type': 'PERSON', 'text': 'Sc9T66XdTiYZ67ZsDXtIt61RjH3Ix4bmDrQzlzHrMU0=', 'operator': 'encrypt'},
    {'start': 1804, 'end': 1848, 'entity_type': 'LOCATION', 'text': 'jCm9dzARnqHI0iJKC5OMieNLge4kdoVGm8grvb3YlAI=', 'operator': 'encrypt'},
    {'start': 1759, 'end': 1803, 'entity_type': 'LOCATION', 'text': 's4EvJlsKlpLYKD0zpGUdfft9ShuIEhrPzDzH7jSYEts=', 'operator': 'encrypt'},
    {'start': 1694, 'end': 1738, 'entity_type': 'PERSON', 'text': 'BKa65ekjqE4WQItErmVhMA/2OOOJN22KHfjgvsCa8so=', 'operator': 'encrypt'},
    {'start': 1536, 'end': 1580, 'entity_type': 'PERSON', 'text': '9ck+Cm18StxyLGQyKNvC2jBXJmrMpWU4sB8ZrFU1kAM=', 'operator': 'encrypt'},
    {'start': 1128, 'end': 1172, 'entity_type': 'PERSON', 'text': 'lBruvd9kF+rvdor093uxwhDtSKL/UK55A3DI+oSywtE=', 'operator': 'encrypt'},
    {'start': 918, 'end': 962, 'entity_type': 'PERSON', 'text': 'VsABjcnQqmUm/j03n4MKg2DqpFCr4pqITtmMifENZeE=', 'operator': 'encrypt'},
    {'start': 555, 'end': 599, 'entity_type': 'PERSON', 'text': 'NliXY4ki34IfIbzZtjE3uNftlnT32WVvoyJNayCdekY=', 'operator': 'encrypt'},
    {'start': 419, 'end': 463, 'entity_type': 'AGE', 'text': 'D/VKKs+lEPKpi0u8sM4GrCgFl5iRa8DYA0X6gj2D5WA=', 'operator': 'encrypt'},
    {'start': 369, 'end': 413, 'entity_type': 'PERSON', 'text': '/VUZ38hgaW8oIOqXKO/V5rhRJapPgYksLqPWPYfsabI=', 'operator': 'encrypt'},
    {'start': 324, 'end': 368, 'entity_type': 'PERSON', 'text': 'xqYsVNVNr18ennd01WUFwd7uN6H2VMU4ciOoEG0WctI=', 'operator': 'encrypt'},
    {'start': 242, 'end': 286, 'entity_type': 'US_ITIN', 'text': '2HhEMucehDL/N9PB25Give8hbskDdkX6PKRVbbmBy3c=', 'operator': 'encrypt'},
    {'start': 192, 'end': 236, 'entity_type': 'DATE_TIME', 'text': 'doSS1fEZlEXjD/4dpBBgX9AfDo1MBQ6a9LIQmuBM/Zs=', 'operator': 'encrypt'},
    {'start': 142, 'end': 186, 'entity_type': 'PERSON', 'text': 'ph8f+GFGdzb0kJ7jtupBDHhQmRah/peKV/UgXXEJxxQ=', 'operator': 'encrypt'},
    {'start': 97, 'end': 141, 'entity_type': 'PERSON', 'text': 'N4/k/tVfrGcIMHuiEeB4tzn1OPnvfqItq2GsaYL6DzE=', 'operator': 'encrypt'},
    {'start': 46, 'end': 90, 'entity_type': 'DATE_TIME', 'text': 'BXBJ6eCU59a5nGvzGtXkVd5oOjJWZ3606NWi6vUgna8=', 'operator': 'encrypt'},
    {'start': 1, 'end': 45, 'entity_type': 'DATE_TIME', 'text': '/oSOg6iCSSvrWeZlXxu68BOeKmiTzcNzQsnJGhBuE14=', 'operator': 'encrypt'}
]

desanitized_results:
text:

May 5, 2023
Name: Carl John Smith
DOB: 04/18/1985
SSN: 999-99-9999
Dear DDS Examiner:
Introduction:
Mr. Carl Smith is a 31-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. Smith says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, Carl is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. Carl is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, Carl responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is Gavin and I plan to go to San Francisco later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Gavin,

Zizhong Ye and Gordon Liu are schoolmates at Chadbroune Elementry School.

Here are a few example sentences we currently support:

Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.

John Smith called Sarah Jane at 321-456-7098 and told her to meet him at 1112 Market Street

During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided me with his personal details. His email is johndoe@example.com and his contact number is 650-456-7890. He lives in New York City, USA, and belongs to the American nationality with Christian beliefs and a leaning towards the Democratic party. He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 and has a medical license number MED-123123456.

items:

[
    {'start': 3252, 'end': 3258, 'entity_type': 'US_DRIVER_LICENSE', 'text': '123456', 'operator': 'decrypt'},
    {'start': 3248, 'end': 3252, 'entity_type': 'ID', 'text': '-123', 'operator': 'decrypt'},
    {'start': 3245, 'end': 3248, 'entity_type': 'ORGANIZATION', 'text': 'MED', 'operator': 'decrypt'},
    {'start': 3200, 'end': 3211, 'entity_type': 'IP_ADDRESS', 'text': '192.168.1.1', 'operator': 'decrypt'},
    {'start': 3105, 'end': 3116, 'entity_type': 'US_SSN', 'text': '669-45-6789', 'operator': 'decrypt'},
    {'start': 3049, 'end': 3058, 'entity_type': 'US_PASSPORT', 'text': '123456789', 'operator': 'decrypt'},
    {'start': 2974, 'end': 2985, 'entity_type': 'US_ITIN', 'text': '987-65-4321', 'operator': 'decrypt'},
    {'start': 2951, 'end': 2960, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'Y12345678', 'operator': 'decrypt'},
    {'start': 2907, 'end': 2923, 'entity_type': 'US_BANK_NUMBER', 'text': '1234567890123456', 'operator': 'decrypt'},
    {'start': 2851, 'end': 2853, 'entity_type': 'LOCATION', 'text': 'US', 'operator': 'decrypt'},
    {'start': 2819, 'end': 2823, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2789, 'end': 2817, 'entity_type': 'URL', 'text': 'https://johndoeportfolio.com', 'operator': 'decrypt'},
    {'start': 2719, 'end': 2746, 'entity_type': 'IBAN_CODE', 'text': 'GB29 NWBK 6016 1331 9268 19', 'operator': 'decrypt'},
    {'start': 2618, 'end': 2652, 'entity_type': 'CRYPTO', 'text': '1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa', 'operator': 'decrypt'},
    {'start': 2551, 'end': 2570, 'entity_type': 'CREDIT_CARD', 'text': '4111 1111 1111 1111', 'operator': 'decrypt'},
    {'start': 2473, 'end': 2478, 'entity_type': 'ORGANIZATION', 'text': 'party', 'operator': 'decrypt'},
    {'start': 2462, 'end': 2472, 'entity_type': 'ORGANIZATION', 'text': 'Democratic', 'operator': 'decrypt'},
    {'start': 2392, 'end': 2400, 'entity_type': 'LOCATION', 'text': 'American', 'operator': 'decrypt'},
    {'start': 2368, 'end': 2371, 'entity_type': 'LOCATION', 'text': 'USA', 'operator': 'decrypt'},
    {'start': 2362, 'end': 2366, 'entity_type': 'LOCATION', 'text': 'City', 'operator': 'decrypt'},
    {'start': 2353, 'end': 2361, 'entity_type': 'LOCATION', 'text': 'New York', 'operator': 'decrypt'},
    {'start': 2327, 'end': 2339, 'entity_type': 'PHONE_NUMBER', 'text': '650-456-7890', 'operator': 'decrypt'},
    {'start': 2281, 'end': 2300, 'entity_type': 'EMAIL_ADDRESS', 'text': 'johndoe@example.com', 'operator': 'decrypt'},
    {'start': 2225, 'end': 2228, 'entity_type': 'PERSON', 'text': 'Doe', 'operator': 'decrypt'},
    {'start': 2220, 'end': 2224, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2201, 'end': 2205, 'entity_type': 'DATE_TIME', 'text': '2023', 'operator': 'decrypt'},
    {'start': 2188, 'end': 2200, 'entity_type': 'DATE_TIME', 'text': 'February 23,', 'operator': 'decrypt'},
    {'start': 2151, 'end': 2157, 'entity_type': 'LOCATION', 'text': 'Street', 'operator': 'decrypt'},
    {'start': 2139, 'end': 2150, 'entity_type': 'LOCATION', 'text': '1112 Market', 'operator': 'decrypt'},
    {'start': 2098, 'end': 2110, 'entity_type': 'PHONE_NUMBER', 'text': '321-456-7098', 'operator': 'decrypt'},
    {'start': 2090, 'end': 2094, 'entity_type': 'PERSON', 'text': 'Jane', 'operator': 'decrypt'},
    {'start': 2084, 'end': 2089, 'entity_type': 'PERSON', 'text': 'Sarah', 'operator': 'decrypt'},
    {'start': 2071, 'end': 2076, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 2066, 'end': 2070, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2054, 'end': 2062, 'entity_type': 'US_DRIVER_LICENSE', 'text': '1234567A', 'operator': 'decrypt'},
    {'start': 2014, 'end': 2025, 'entity_type': 'US_SSN', 'text': '078-05-1126', 'operator': 'decrypt'},
    {'start': 1981, 'end': 1985, 'entity_type': 'PERSON', 'text': 'Kate', 'operator': 'decrypt'},
    {'start': 1966, 'end': 1978, 'entity_type': 'US_BANK_NUMBER', 'text': '954567876544', 'operator': 'decrypt'},
    {'start': 1892, 'end': 1915, 'entity_type': 'IBAN_CODE', 'text': 'IL150120690000003111111', 'operator': 'decrypt'},
    {'start': 1824, 'end': 1838, 'entity_type': 'PHONE_NUMBER', 'text': '(212) 555-1234', 'operator': 'decrypt'},
    {'start': 1793, 'end': 1802, 'entity_type': 'US_PASSPORT', 'text': '191280342', 'operator': 'decrypt'},
    {'start': 1766, 'end': 1777, 'entity_type': 'IP_ADDRESS', 'text': '192.168.0.1', 'operator': 'decrypt'},
    {'start': 1733, 'end': 1751, 'entity_type': 'EMAIL_ADDRESS', 'text': 'test@presidio.site', 'operator': 'decrypt'},
    {'start': 1698, 'end': 1711, 'entity_type': 'URL', 'text': 'microsoft.com', 'operator': 'decrypt'},
    {'start': 1685, 'end': 1687, 'entity_type': 'DATE_TIME', 'text': '18', 'operator': 'decrypt'},
    {'start': 1675, 'end': 1684, 'entity_type': 'DATE_TIME', 'text': 'September', 'operator': 'decrypt'},
    {'start': 1635, 'end': 1669, 'entity_type': 'CRYPTO', 'text': '16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ', 'operator': 'decrypt'},
    {'start': 1588, 'end': 1607, 'entity_type': 'CREDIT_CARD', 'text': '4095-2609-9393-4932', 'operator': 'decrypt'},
    {'start': 1556, 'end': 1561, 'entity_type': 'LOCATION', 'text': 'Maine', 'operator': 'decrypt'},
    {'start': 1534, 'end': 1541, 'entity_type': 'PERSON', 'text': 'Johnson', 'operator': 'decrypt'},
    {'start': 1528, 'end': 1533, 'entity_type': 'PERSON', 'text': 'David', 'operator': 'decrypt'},
    {'start': 1445, 'end': 1451, 'entity_type': 'ORGANIZATION', 'text': 'School', 'operator': 'decrypt'},
    {'start': 1435, 'end': 1444, 'entity_type': 'ORGANIZATION', 'text': 'Elementry', 'operator': 'decrypt'},
    {'start': 1424, 'end': 1434, 'entity_type': 'ORGANIZATION', 'text': 'Chadbroune', 'operator': 'decrypt'},
    {'start': 1401, 'end': 1404, 'entity_type': 'PERSON', 'text': 'Liu', 'operator': 'decrypt'},
    {'start': 1394, 'end': 1400, 'entity_type': 'PERSON', 'text': 'Gordon', 'operator': 'decrypt'},
    {'start': 1387, 'end': 1389, 'entity_type': 'PERSON', 'text': 'Ye', 'operator': 'decrypt'},
    {'start': 1379, 'end': 1386, 'entity_type': 'PERSON', 'text': 'Zizhong', 'operator': 'decrypt'},
    {'start': 1378, 'end': 1379, 'entity_type': 'PERSON', 'text': '\n', 'operator': 'decrypt'},
    {'start': 1377, 'end': 1378, 'entity_type': 'PERSON', 'text': '\n', 'operator': 'decrypt'},
    {'start': 1371, 'end': 1377, 'entity_type': 'PERSON', 'text': 'Gavin,', 'operator': 'decrypt'},
    {'start': 1225, 'end': 1234, 'entity_type': 'LOCATION', 'text': 'Francisco', 'operator': 'decrypt'},
    {'start': 1221, 'end': 1224, 'entity_type': 'LOCATION', 'text': 'San', 'operator': 'decrypt'},
    {'start': 1195, 'end': 1200, 'entity_type': 'PERSON', 'text': 'Gavin', 'operator': 'decrypt'},
    {'start': 1077, 'end': 1081, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 709, 'end': 713, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 539, 'end': 543, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 215, 'end': 220, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 121, 'end': 123, 'entity_type': 'AGE', 'text': '31', 'operator': 'decrypt'},
    {'start': 110, 'end': 115, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 105, 'end': 109, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 56, 'end': 67, 'entity_type': 'US_ITIN', 'text': '999-99-9999', 'operator': 'decrypt'},
    {'start': 40, 'end': 50, 'entity_type': 'DATE_TIME', 'text': '04/18/1985', 'operator': 'decrypt'},
    {'start': 29, 'end': 34, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 19, 'end': 28, 'entity_type': 'PERSON', 'text': 'Carl John', 'operator': 'decrypt'},
    {'start': 8, 'end': 12, 'entity_type': 'DATE_TIME', 'text': '2023', 'operator': 'decrypt'},
    {'start': 1, 'end': 7, 'entity_type': 'DATE_TIME', 'text': 'May 5,', 'operator': 'decrypt'}
]

Result:

Traceback (most recent call last):
  File "/home/zzz/workspace/example/pg.py", line 929, in <module>
    _test()
  File "/home/zzz/workspace/example/pg.py", line 914, in _test
    response_j = sanitize_text(j.encode())
  File "/home/zzz/workspace/example/pg.py", line 597, in sanitize_text
    assert desanitized_results.text == text
AssertionError
octaviansima commented 1 year ago

To add on to this, I'm running into the following error

  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/analyzer_engine.py", line 189, in analyze
    nlp_artifacts = self.nlp_engine.process_text(text, language)
  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/spacy_nlp_engine.py", line 44, in process_text
    doc = self.nlp[language](text)
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1047, in __call__
    error_handler(name, proc, [doc], e)
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/util.py", line 1724, in raise_error
    raise e
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1042, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/transformers_nlp_engine.py", line 71, in __call__
    doc.ents = ents
  File "spacy/tokens/doc.pyx", line 796, in spacy.tokens.doc.Doc.ents.__set__
  File "spacy/tokens/doc.pyx", line 833, in spacy.tokens.doc.Doc.set_ents
ValueError: [E1010] Unable to set entity information for token 28 which is included in more than one span in entities, blocked, missing or outside.

With the following code sample

import transformers

from huggingface_hub import snapshot_download

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine

transformers_model = "obi/deid_roberta_i2b2"

snapshot_download(repo_id=transformers_model)

# Instantiate to make sure it's downloaded during installation and not runtime
transformers.AutoTokenizer.from_pretrained(transformers_model)
transformers.AutoModelForTokenClassification.from_pretrained(transformers_model)

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "transformers",
    "models": [
        {
            "lang_code": "en",
            "model_name": {
                "spacy": "en_core_web_sm",
                "transformers": transformers_model,
            },
        }
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])

# Initialize the anonymizer and deanonymizer engines
# Possibly put these into a server to avoid reinitialization
anonymizer = AnonymizerEngine()
deanonymizer = DeanonymizeEngine()

text = """
During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided 
me with his personal details. His email is johndoe@example.com and his contact 
number is 650-456-7890. He lives in New York City, USA, and belongs to the 
American nationality with Christian beliefs and a leaning towards the Democratic party. 
He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 
and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. 
While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. 
Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some 
of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license 
is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for 
which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. 
Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 
and has a medical license number MED-123456.
"""

analysis_results = analyzer.analyze(text=text, language="en")

I believe this should be related

omri374 commented 1 year ago

Hi @octaviansima, we are aware of this issue. Until we fix it (WIP), it is recommended to use the TransformersRecognizer approach and not the TransformerNlpEngine. This should help with your issue, but please let us know if it doesn't.

omri374 commented 1 year ago

Hi @zizhong, I did an attempt to reproduce this but wasn't able to. Steps I've taken:

  1. Create a TransformersRecognizer and configuration using this sample
  2. Call:
    
    from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
    from presidio_analyzer.nlp_engine import NlpEngineProvider
    import spacy

model_path = "obi/deid_roberta_i2b2" supported_entities = BERT_DEID_CONFIGURATION.get( "PRESIDIO_SUPPORTED_ENTITIES") transformers_recognizer = TransformersRecognizer(model_path=model_path, supported_entities=supported_entities)

This would download a large (~500Mb) model on the first run

transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)

Add transformers model to the registry

registry = RecognizerRegistry() registry.add_recognizer(transformers_recognizer) registry.remove_recognizer("SpacyRecognizer")

Use small spacy model, for faster inference.

if not spacy.util.is_package("en_core_web_sm"): spacy.cli.download("en_core_web_sm")

nlp_configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}], }

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine) results = analyzer.analyze(text, language="en", return_decision_process=True)

Where text = the text you provided
3. Encrypt:
```python
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorResult, OperatorConfig
from presidio_anonymizer.operators import Decrypt

key="16charEncryptKey16charEncryptKey"

engine = AnonymizerEngine()

# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer)
# and an 'encrypt' operator to get an encrypted anonymization output:
anonymize_result = engine.anonymize(
    text=text,
    analyzer_results=results,
    operators={"DEFAULT": OperatorConfig("encrypt", {"key": key})},
)

# Fetch the anonymized text from the result.
anonymized_text = anonymize_result.text

# Fetch the anonynized entities from the result.
anonymized_entities = anonymize_result.items
  1. Decrypt:
    
    # Initialize the engine:
    engine = DeanonymizeEngine()

Invoke the deanonymize function with the text, anonymizer results

and a 'decrypt' operator to get the original text as output.

deanonymized_result = engine.deanonymize( text=anonymized_text, entities=anonymized_entities, operators={"DEFAULT": OperatorConfig("decrypt", {"key": key})}, )

deanonymized_result.text


5. Compare: `assert text == deanonymized_result.text`

We had a few contributions to the `presidio-anonymizer` package which aren't released to PyPI yet. It could to be that one of them (like #1092 or #1078) is the source of the difference.
zizhong commented 1 year ago

Thanks! The issue is with the code from 🤗 presidio-demo The issue was caused by chunking overlap. I added some check filtering out the overlaps in predications. Now the issue is resolved.

omri374 commented 1 year ago

Thanks! if you let us know what the issue was, that would be very helpful!

zizhong commented 1 year ago

@omri374 sure thing. https://huggingface.co/spaces/presidio/presidio_demo/blob/main/transformers_rec/transformers_recognizer.py#L267 Here the predications can have overlaps as there is a text_overlap_length for chunking. https://huggingface.co/spaces/presidio/presidio_demo/blob/main/transformers_rec/transformers_recognizer.py#L248

I think that is intended for the use case of only anonymize() used. However it becomes a problem if deanonymize() is applied.