tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

SemanticType "STREET_ADDRESS2_EN" labeled records as invalid #48

Closed fredxing closed 1 year ago

fredxing commented 1 year ago

The following STREET_ADDRESS2_EN smanticType values were labeled as invalid but not sure why (only the first one was labeled as valid).

  @Test
  public void testFTA_Addressline2() throws Exception {
    String[] fieldnames = new String[] {"AddressLine2"};
    String[][] values = new String[][] {
      new String[] {"MIDDLEBURY, CT 06762"},
      new String[] {"DANVERS, MA 01923-3782"},
      new String[] {"SAN JOSE, CA 95123-3696"},
      new String[] {"JACKSONVILLE, FL 32202-1031"},
      new String[] {"MORIARTY, NM 87035"},
      new String[] {"ALEXANDRIA, MO 63430-9801"},
      new String[] {"BROOKSHIRE, TX 77423-9440"},
      new String[] {"CARROLL, IA 51401-9167"},
      new String[] {"BUFFALO, NY 14223"},
      new String[] {"HOUSTON, TX 77002-2526"}
    };

    AnalyzerContext context = new AnalyzerContext(null, DateTimeParser.DateResolutionMode.Auto, "ftatest", fieldnames);
    TextAnalyzer textAnalyzer = new TextAnalyzer(context);
    textAnalyzer.setLocale(Locale.getDefault());
    RecordAnalyzer analyzer = new RecordAnalyzer(textAnalyzer);

    for (String[] value : values) {
      analyzer.train(value);
    }

    for (TextAnalysisResult result : analyzer.getResult().getStreamResults()) {
      String json = result.asJSON(true, 1);
      System.out.println(json);
    }
  }

Output,

{
  "fieldName" : "AddressLine2",
  "totalCount" : -1,
  "sampleCount" : 10,
  "matchCount" : 1,
  "nullCount" : 0,
  "blankCount" : 0,
  "distinctCount" : 1,
  "regExp" : ".*",
  "confidence" : 0.9,
  "type" : "String",
  "isSemanticType" : true,
  "semanticType" : "STREET_ADDRESS2_EN",
  "min" : "MIDDLEBURY, CT 06762",
  "max" : "MIDDLEBURY, CT 06762",
  "minLength" : 17,
  "maxLength" : 27,
  "topK" : [ "MIDDLEBURY, CT 06762" ],
  "bottomK" : [ "MIDDLEBURY, CT 06762" ],
  "cardinality" : 1,
  "cardinalityDetail" : [ {
    "key" : "MIDDLEBURY, CT 06762",
    "count" : 1
  } ],
  "outlierCardinality" : 0,
  "invalidCardinality" : 9,
  "invalidDetail" : [ {
    "key" : "ALEXANDRIA, MO 63430-9801",
    "count" : 1
  }, {
    "key" : "BROOKSHIRE, TX 77423-9440",
    "count" : 1
  }, {
    "key" : "BUFFALO, NY 14223",
    "count" : 1
  }, {
    "key" : "CARROLL, IA 51401-9167",
    "count" : 1
  }, {
    "key" : "DANVERS, MA 01923-3782",
    "count" : 1
  }, {
    "key" : "HOUSTON, TX 77002-2526",
    "count" : 1
  }, {
    "key" : "JACKSONVILLE, FL 32202-1031",
    "count" : 1
  }, {
    "key" : "MORIARTY, NM 87035",
    "count" : 1
  }, {
    "key" : "SAN JOSE, CA 95123-3696",
    "count" : 1
  } ],
  "shapesCardinality" : 7,
  "shapesDetail" : [ {
    "key" : "XXXXXXX, XX 99999-9999",
    "count" : 3
  }, {
    "key" : "XXXXXXXXXX, XX 99999-9999",
    "count" : 2
  }, {
    "key" : "XXX XXXX, XX 99999-9999",
    "count" : 1
  }, {
    "key" : "XXXXXXX, XX 99999",
    "count" : 1
  }, {
    "key" : "XXXXXXXX, XX 99999",
    "count" : 1
  }, {
    "key" : "XXXXXXXXXX, XX 99999",
    "count" : 1
  }, {
    "key" : "XXXXXXXXXXXX, XX 99999-9999",
    "count" : 1
  } ],
  "leadingWhiteSpace" : false,
  "trailingWhiteSpace" : false,
  "multiline" : false,
  "keyConfidence" : 0.0,
  "uniqueness" : 1.0,
  "detectionLocale" : "en-US",
  "ftaVersion" : "14.7.2",
  "structureSignature" : "8Yl3CN3MVlSIjCrXBgZCWJskI1Q=",
  "dataSignature" : "NOJxl078OfwSpEwBQ9qQm2tTOIo="
}
tsegall commented 1 year ago

This seems like a strange test. From the web ...

Address Line 2 is a field commonly added to address forms to allow users to enter secondary address unit designators. Valid entry values include address components like the apartment, suite, room, floor, building, unit, and department numbers, along with PO Boxes.

Or from the US Government ...

Address 2 This is the second line, if needed, of an address, typically a building name or post office box number. Post office box numbers should be used only in mailing addresses.

So we are normally expecting 'Apt 168' or 'PO Box 4768' etc.

Is this case from a real customer or is it synthetic?

tsegall commented 1 year ago

So in direct answer to your question. Since the Header is clearly indicating the field is 'Address Line 2' - i.e. it is 100% match for a good header - I trust the header. When I look at the data - almost none of the data looks like an Address Line 2 so I reject almost all of them. The reason 'MIDDLEBURY, CT 06762' is returned as valid is because I see CT which I assume is an abbreviation for Court.

fredxing commented 1 year ago

Thanks Tim for the explanation. Closed it.

fredxing commented 1 year ago

One more question, is there a SemanticType for the address line 3 defined in USPS address?

Ms. Suzanne Smith
123 Main Street, Unit 12
Chicago, IL 12345  <------ this line?
tsegall commented 1 year ago

See https://github.com/tsegall/fta#address-detection, however the short answer is not reliably BUT it seems like it should.

For the fta-generated test file below with the header set to 'Name,Addr1,Addr2' it will detect the following:

Name: NAME.FIRST_LAST Addr1: STREET_ADDRESS_EN Addr2: STREET_ADDRESS2_EN

However if the header is 'Name,A1,A2' then A2 is not detected as STREET_ADDRESS2_EN

Please also note https://github.com/tsegall/fta#why-is-fta-not-detecting-the-semantic-type-xxx when you are generating synthetic data, for example if you provide invalid zips they will not pass zip validation and hence confuse detection.

Let me review and get back to you.


"KAITLIN CHICK","354 Cedarstone Rd","Suzhou,NV,56228" "GERALDINE BRIM","896 Dawson Drive","Tasikmalaya,OH,35549" "GERALDO DELTORO","599 Chaton Avenue","Salta,WA,82081" "GABRIELE DELUCA","290 Atha Rd","Boston,MA,66027" "GERALYN BRINK","313 Rosemont Avenue","Jieyang,MT,05489" "GERARD HANNULA","673 12th Street","Cotonou,LA,17951" "GUILLERMINA CHICO","123 Eight Mile Rd.","Varanasi,WI,43413" "GUILLERMO DELUCIA","611 Jefferson Road","Yichun Heilongjiang,NC,82414" "GABRIELLA CHIEN","688 High Street","Cheboksary,NV,60914" "GERARDO CHILD","398 Chatham Drive","Jinhua,WY,47740" "KAITLYN BRINKER","127 Lakeside Terrace","Bhavnagar,PA,05440" "GABRIELLE HANRAHAN","653 Charack Avenue","Van,MS,59935" "KAJA DONATO","260 Buchanan St","Hargeysa,DC,28386" "KALA DONEGAN","634 Buchanan St","Dengzhou,AS,78022" "KALEO CHILDERS","644 Pratt St","Panjin,NE,77217" "GERDA BRINKERHOFF","303 Cedarstone Drive","Cape Town,MO,74604" "GERI BRINKLEY","403 Chatham Drive","Hangzhou,AS,22812" "GAEL HANS","784 Kenwood Rd","Bengbu,IN,97535" "ELIOT DONES","427 Randolph Terrace","Antalya,KY,60463" "GERMAINE DONG","124 School House Rd","Jixi Heilongjiang,MT,08327" "GAËL CHILDRESS","277 Atha Drive","Guangzhou,PA,98624" "GUIOMAR HANSEL","629 Chatham Rd","Maputo,SC,57005" "GERMAN CHILDS","783 Pratt Road","Yongzhou,UT,62205" "GERONIMO HANSEN","743 Flushing St","Sapporo,WI,79601" "GUS CHILES","935 Lakeside Drive","Mosul,CA,25090" "GAETAN CHILTON","234 High Ave","Qingzhou,DE,66091" "KALEVI CHIN","997 Jefferson Terrace","Haifa,AK,03268" "KALEY HANSON","593 Overlook Avenue","Jundiai,GU,28012" "KALI DONLEY","871 Meadow Ave","Malegaon,VI,39740" "GAËTAN DELUNA","214 Dawson Avenue","Marrakech,KY,54839" "GERRI HARBAUGH","227 North Point Ave","Recife,AR,31202" "GUSSIE HARBER","82 Central Ave","Xinmi,LA,45490" "GUSTAV DONNELL","508 Happy Hollow Rd.","Quetta,CT,74966" "ELIOTT CHINCHILLA","264 Buchanan St","Mashhad,UT,28125" "ELISA CHING","480 D St","Shenyang,TX,50129" "KAM HARBIN","540 Candlelight Drive","Accra,MN,71032" "GAIL HARBISON","6 12th Street","Nagoya,OR,33564" "GUSTAVE BRINKMAN","973 Cedarstone Rd","Addis Ababa,CO,54011" "KAMI HARBOUR","138 12th St","Liuzhou,KS,78370" "GERRIT DONNELLY","422 Croydon Rd.","Cixi,CT,79786"

tsegall commented 1 year ago

See new enhancement ... Issue

tsegall commented 1 year ago

As of 15.0.1 the above test will return STREET_ADDRESS2_EN.

fredxing commented 1 year ago

It would be great appreciated if you can also release 15.0.1 to Maven repository, so I can give it a try.

tsegall commented 1 year ago

Done.

fredxing commented 1 year ago

Tested, it worked as expected. Thanks!