urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.25k stars 104 forks source link

Not able to detect pii prperly #168

Open Hir98 opened 1 month ago

Hir98 commented 1 month ago

My text is :

Detect SSN,DOB,CreditCard,CVV,Expiration, and Gender in the following and anonymize them by replacing with fictitious data. Note Do not not mask the data: 

Michael Brown,345-67-8901,07/30/1978,3400 0000 0000 009,789,12/23,Male,Photography, Cooking 
Jessica Davis,456-78-9012,11/05/1982,6011 0000 0000 0004,012,03/26,Female,Painting, Cycling 
Emily Johnson,234-56-7890,03/22/1990,5500 0000 0000 0004,456,08/24,Female,Traveling, Yoga 
John Smith,123-45-6789,01/15/1985,4111 1111 1111 1111,123,11/25,Male,Reading, Hiking 
David Wilson,567-89-0123,05/17/1975,3000 0000 0000 04,345,06/25,Male,Golf, Music

and my label is:

 ["person","username",
     "email","email address",
     "address",
     "phone number","mobile phone number","landline phone number","mobile_phone_number","phone_number",
     "credit card CVV", "credit card CVC","CVV","credit card cvv"
     "social security number","security code","credit card security number","social_security_number","credit card security code","bank_account_number","bank account number",
     "driver's license number","US_SSN",
     "birth date","birthdate","date","date_of_birth","expiration date","departure date","arrival date",
     "credit card expiration date","passport issue date","card expiration date","passport expiration date","datetime"]

but in the response i am getting


[
  {
    'start': 8,
    'end': 11,
    'text': 'SSN',
    'label': 'social_security_number',
    'score': 0.5743955373764038
  },
  {
    'start': 27,
    'end': 30,
    'text': 'CVV',
    'label': 'CVV',
    'score': 0.5180321335792542
  },
  {
    'start': 158,
    'end': 171,
    'text': 'Michael Brown',
    'label': 'person',
    'score': 0.9997666478157043
  },
  {
    'start': 184,
    'end': 194,
    'text': '07/30/1978',
    'label': 'date_of_birth',
    'score': 0.5781332850456238
  },
  {
    'start': 214,
    'end': 217,
    'text': '789',
    'label': 'credit card CVV',
    'score': 0.3783819377422333
  },
  {
    'start': 218,
    'end': 223,
    'text': '12/23',
    'label': 'card expiration date',
    'score': 0.32416781783103943
  },
  {
    'start': 251,
    'end': 264,
    'text': 'Jessica Davis',
    'label': 'person',
    'score': 0.9983078241348267
  },
  {
    'start': 265,
    'end': 276,
    'text': '456-78-9012',
    'label': 'credit card CVV',
    'score': 0.3885277211666107
  },
  {
    'start': 277,
    'end': 287,
    'text': '11/05/1982',
    'label': 'date_of_birth',
    'score': 0.5117868781089783
  },
  {
    'start': 308,
    'end': 311,
    'text': '012',
    'label': 'credit card CVV',
    'score': 0.3981240391731262
  },
  {
    'start': 344,
    'end': 357,
    'text': 'Emily Johnson',
    'label': 'person',
    'score': 0.9977560639381409
  },
  {
    'start': 370,
    'end': 380,
    'text': '03/22/1990',
    'label': 'date_of_birth',
    'score': 0.4856725335121155
  },
  {
    'start': 435,
    'end': 445,
    'text': 'John Smith',
    'label': 'person',
    'score': 0.9992269277572632
  },
  {
    'start': 446,
    'end': 457,
    'text': '123-45-6789',
    'label': 'credit card CVV',
    'score': 0.5499760508537292
  },
  {
    'start': 458,
    'end': 468,
    'text': '01/15/1985',
    'label': 'date_of_birth',
    'score': 0.40841439366340637
  },
  {
    'start': 521,
    'end': 533,
    'text': 'David Wilson',
    'label': 'person',
    'score': 0.9988730549812317
  },
  {
    'start': 546,
    'end': 556,
    'text': '05/17/1975',
    'label': 'date_of_birth',
    'score': 0.3618524372577667
  }
]

actual "123-45-6789" is SSN but it detect it as credict card CVV

@urchade can you please tell me what is wrong in this?

hari-ag00 commented 2 weeks ago

maybe you could try generating synthetic data and finetune the model and add post processing validators