wbuchanan / StataStringUtilities

Stata plugins to Java libraries that provide utilities for analyzing, parsing, and/or working with String data more generally.
https://wbuchanan.github.io/StataStringUtilities/
6 stars 3 forks source link

phoneticenc - issues #3

Closed sogervais closed 7 years ago

sogervais commented 7 years ago

Hi. I've been working with the strutil package on a dataset of nearly 3500 student records. I'm matching records to my master list of students and comparing names to be sure student results are encoded to the correct students.

Anyway, I'm using the phoneticenc function. The metaphone, double metaphone and nysiis methods are returning only a single value for all 3500 records. (metaphone & double metapnone returns "NLNL"; nysiis returns "NALNAL"). The beidermorse method hangs up and returns the screenshot below: phoneticenc_beidermorse_lname_error

Everything else seems to be working as expected.

Stephen

wbuchanan commented 7 years ago

Hi Stephen,

If you run query java you should see a few different things you can set. Try setting the heapmax value higher to avoid the out of memory issue. Are you able to provide a few string values that I can check on my end?

sogervais commented 7 years ago

Hi William. I've left work for the day and will return on Monday. I've got some examples I'll pull together for you on the encoding output.

As for encoding beidermorse issue:

OK on heapmax - initial setting was 384m. I've raised it a couple times Issuing the command phoneticenc lastname, beidermorse(beidermorse) against the 3500 names in my PSAT dataset. Here's what happens.

I've bumped up the max to 1152m. Heap usage reports 1113m at failure.
I went up to 4608m. Heap usage reports 4541m at failure.

To what value should I set heapmax?

Stephen

sogervais commented 7 years ago

Gremlins! Everything ran today without a hitch.

I reset the heapmax to the initial setting to duplicate my original conditions to provide some output.
Everything ran without a hitch. Problems with metaphone, double metaphone and nysiis producing suspect output did not return. Beider-Morse method works quickly with no memory error.

Sorry to bother you with what now appears to be a false alarm. I look forward to using the complete package.

Stephen

wbuchanan commented 7 years ago

@sogervais Awesome. Glad to hear it is working. One of the things that I ended up learning a bit later was that all HTTPS requests spin up the JVM in Stata, so it may have had something to do with the JVM already being spun up. If there are other things that you think would be useful feel free to let me know.

sogervais commented 7 years ago

Thanks for following up.
I think I have a better handle on what is happening when I run into trouble.

While I get output when I run the phoneticenc command, when comparing the first/last name encodings of reported student I find that the encodings for highly different names are virtually the same. I am not a Java programmer but this looks to me like some cache is not being cleared between runs of the phoneticenc command.

I am also still having heapmax errors using the beidermorse encoding despite upping the heatmax level. Yesterday I had it working with a heapmax setting of 1536m. Today, not working. Screenshot below with my output:

phonetic_trouble_ex

The commands below will let you load my same dataset to see if you are getting something different:

/* Duplicate Stephen's Phonetic Error Issues

Connect to my public dropbox, load name match file, and prepare phonetic encodings

My java setup has heapmax set to 1536 (Should this be higher?)

Stata environment Stata IC 14.2 Windows PC, Windows 7, i7 CPU with 16GB memory

*/

// Load Name File use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear

// Produce phonetic encoding for lastname phoneticenc lastname, caverphone1(cav1_lastname) caverphone2(cav2_lastname) dms(dms_lastname) dblm(dblm_lastname)

// Produce phonetic encoding for first name phoneticenc firstname, caverphone1(cav1_firstname) caverphone2(cav2_firstname) dms(dms_firstname) dblm(dblm_firstname)

// Still getting the heapmax error I had previously // uncomment to check heapmax error // phoneticenc lastname, beiderm(bm_lastname)

// Check how different these names are and sort by similarity strdist lastname firstname, levenshtein(lev_distance) strdist lastname firstname, jarowinklers(jw_similarity) sort jw_similarity

// Check if these encodings are the same gen cav1_same = 0 replace cav1_same = 1 if cav1_lastname == cav1_firstname

gen cav2_same = 0 replace cav2_same = 1 if cav2_lastname == cav2_firstname

gen dms_same = 0 replace dms_same = 1 if dms_lastname == dms_firstname

gen dblm_same = 0 replace dblm_same = 1 if dblm_lastname == dblm_firstname

Stephen

---- William Buchanan notifications@github.com wrote:

@sogervais Awesome. Glad to hear it is working. One of the things that I ended up learning a bit later was that all HTTPS requests spin up the JVM in Stata, so it may have had something to do with the JVM already being spun up. If there are other things that you think would be useful feel free to let me know.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/wbuchanan/StataStringUtilities/issues/3#issuecomment-271425458

wbuchanan commented 7 years ago

@sogervais you should be able to set the heap max fairly high and leave it there. Java should only consume the memory needed to process things. I'll try to check things out in the morning, but it is possible for some of the phonetic encodings to return similar results for seemingly different strings depending on how the encodings work internally.

sogervais commented 7 years ago

OK. I will max out the heapmax setting and see what happens tomorrow. For comparison purposes, I'm going to run the same list of names against a python version(s) of a phonetic encoder to see how the output differs.

See what happens when you switch the order of the names being processed. (Firstname then Lastname).

wbuchanan commented 7 years ago

Can you print out a few of the results for the last name when you run things? Something bizzarre seems to be happening with the name field; it might be a string encoding issue where there are some values that are rendered in when viewing the data but is affecting the bytes in the data itself. For example, when I ran some of the code you have above I was getting different values for the same last name. When I manually create a data set with those same values with a few different amounts of duplication I get the anticipated behavior:

 clear

. set obs 10
number of observations (_N) was 0, now 10

. g nm = cond(inrange(_n, 1, 5), "ABONCE", cond(inrange(_n, 6, 8), "ABDO", "ACEVEDO"))

. phoneticenc nm, caverphone1(cav1) caverphone2(cav2) dms(dms) dblm(dblm) beiderm(bm)

. li

     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |      nm     cav1         cav2      dms   dblm                                                                                                                            bm |
     |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  1. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  2. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  3. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  4. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  5. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
     |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  6. |    ABDO   APTN11   APTNA11111   073680   APTN                                                               YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
  7. |    ABDO   APTN11   APTNA11111   073680   APTN                                                               YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
  8. |    ABDO   APTN11   APTNA11111   073680   APTN                                                               YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
  9. | ACEVEDO   ASFTN1   ASFTNA1111   047368   ASFT               akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
 10. | ACEVEDO   ASFTN1   ASFTNA1111   047368   ASFT               akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I think there is likely some string encoding issues that may not have been handled:

. use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear

. desc

Contains data from https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta
  obs:         3,275                          
 vars:             3                          10 Jan 2017 14:49
 size:       114,625                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
stu_id          double  %9.0g                 
lastname        str15   %15s                  Last Name
firstname       str12   %12s                  First Name
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: lastname  firstname

. count if lastname == "ABONCE"
  0

. li lastname in 1/10

     +-----------------+
     |        lastname |
     |-----------------|
  1. | ABARCA          |
  2. | ABDO            |
  3. | ABDO            |
  4. | ABEKAA          |
  5. | ABEYTA          |
     |-----------------|
  6. | ABONCE          |
  7. | ABONCE          |
  8. | ABREGO          |
  9. | ACEVEDO         |
 10. | ACEVEDO         |
     +-----------------+

. loc x : di lastname[6]

. di `"`x'"'
ABONCE         

. count if lastname == "ABONCE"
  0

. count if lastname == `"`x'"'
  2

It seems entering the value ABONCE from the keyboard is not the same as the value stored in the data itself. If you are starting from a plain text file you might want to check out the unicode commands to see whether or not there are multiple string encodings in the file or if the file has a different encoding than the system was expecting. I ended up winning a bet from someone back in Minneapolis related to string encoding issues and they are usually really difficult to identify unless you've run into them a bunch in the past and are thinking about them as a potential issue all the time.

sogervais commented 7 years ago

Hi. Thanks for looking into this with me. I had run some values as you had when I thought all was fine. Switching back to my name match list, I got these weird results. I appreciate your assistance in looking into this.

My name table is being created on the fly as an import via ODBC from my MS SQL Server 2012 research database. Name values in the database are stored as varchars and are just basic ASCII on import into Stata. No difference in output whether it is from the original file or the CSV.

Output from Plaintext CSV file

     +---------------------------------------------------------------+
     |        lastname   cav1_l~e   cav2_las~e   dms_la~e   dblm_l~e |
     |---------------------------------------------------------------|
  1. | ABARCA              NLPKKP   NLPKKPTNA1     687945       NLPR |
  2. | ABDO                NLPTSR   NLPTSRNA11     687349       NLPT |
  3. | ABDO                NLPTSN   NLPTSNA111     687345       NLPT |
  4. | ABEKAA              NLPKPM   NLPKPMLNA1     687576       NLPK |
  5. | ABEYTA              NLPTMN   NLPTMNNA11     687366       NLPT |
     |---------------------------------------------------------------|
  6. | ABONCE              NLPNSK   NLPNSKTLPN     687645       NLPN |
  7. | ABONCE              NLPNSM   NLPNSMRNA1     687646       NLPN |
  8. | ABREGO              NLPRKL   NLPRKLNNA1     687958       NLPR |
  9. | ACEVEDO             NLSFTK   NLSFTKRSTF     684734       NLSF |
 10. | ACEVEDO             NLSFTT   NLSFTTNNA1     684733       NLSF |
     +---------------------------------------------------------------+

Output from SQL Server

    +---------------------------------------------------------------+
     |        lastname   cav1_l~e   cav2_las~e   dms_la~e   dblm_l~e |
     |---------------------------------------------------------------|
  1. | ABARCA              NLPKKP   NLPKKPTNA1     687945       NLPR |
  2. | ABDO                NLPTSR   NLPTSRNA11     687349       NLPT |
  3. | ABDO                NLPTSN   NLPTSNA111     687345       NLPT |
  4. | ABEKAA              NLPKPM   NLPKPMLNA1     687576       NLPK |
  5. | ABEYTA              NLPTMN   NLPTMNNA11     687366       NLPT |
     |---------------------------------------------------------------|
  6. | ABONCE              NLPNSK   NLPNSKTLPN     687645       NLPN |
  7. | ABONCE              NLPNSM   NLPNSMRNA1     687646       NLPN |
  8. | ABREGO              NLPRKL   NLPRKLNNA1     687958       NLPR |
  9. | ACEVEDO             NLSFTK   NLSFTKRSTF     684734       NLSF |
 10. | ACEVEDO             NLSFTT   NLSFTTNNA1     684733       NLSF |
     +---------------------------------------------------------------+

I can duplicate your output including the beidermorse encoding

     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |      nm     cav1         cav2      dms   dblm                                                                                                                            bm |
     |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  1. |  ABONCE   APNSN1   APNSNA1111   076468   APNS                       abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul |
  2. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  3. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
  4. |  ABONCE   APNSN1   APNSNA1111   076468   APNS                       abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul |
  5. |  ABONCE   APNSN1   APNSNA1111   076468   APNS   abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
     |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  6. |    ABDO   APTN11   APTNA11111   073680   APTN                                                                                               abdonul|avdonul|obdonul|ovdonul |
  7. |    ABDO   APTN11   APTNA11111   073680   APTN                                                                                               abdonul|avdonul|obdonul|ovdonul |
  8. |    ABDO   APTN11   APTNA11111   073680   APTN                                                                                               abdonul|avdonul|obdonul|ovdonul |
  9. | ACEVEDO   ASFTN1   ASFTNA1111   047368   ASFT               akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
 10. | ACEVEDO   ASFTN1   ASFTNA1111   047368   ASFT               akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

However, if I use the same technique to add a second name column g oth = cond(inrange(_n, 1, 5), "ZARATE", cond(inrange(_n, 6, 8), "WEISS", "YANG")) before I run the phoneticenc command, then the reported output differs when I list.

The OTH variable should not influence the encoding of the NM variable, correct?

     +--------------------------------------------------------+
     |      nm      oth     cav1         cav2      dms   dblm |
     |--------------------------------------------------------|
  1. |  ABONCE   ZARATE   APNSSR   APNSSRTNA1   076449   APNS |
  2. |  ABONCE   ZARATE   APNSSR   APNSSRTNA1   076449   APNS |
  3. |  ABONCE   ZARATE   APNSSR   APNSSRTNA1   076449   APNS |
  4. |  ABONCE   ZARATE   APNSSR   APNSSRTNA1   076449   APNS |
  5. |  ABONCE   ZARATE   APNSSR   APNSSRTNA1   076449   APNS |
     |--------------------------------------------------------|
  6. |    ABDO    WEISS   APTWSN   APTWSNA111   073746   APTS |
  7. |    ABDO    WEISS   APTWSN   APTWSNA111   073746   APTS |
  8. |    ABDO    WEISS   APTWSN   APTWSNA111   073746   APTS |
  9. | ACEVEDO     YANG   ASFTYN   ASFTNKNA11   047316   ASFT |
 10. | ACEVEDO     YANG   ASFTYN   ASFTNKNA11   047316   ASFT |
     +--------------------------------------------------------+

As far as I can tell, the difference does not appear to be unicode related. To me it seems that the two string are bleeding together somehow. I've tested this using a variation on your procedure using the cologne encoding.

set obs 10

g nm = cond(inrange(_n, 1, 5), "ABONCE", cond(inrange(_n, 6, 8), "ABDO", "ACEVEDO")) phoneticenc nm, col(nm_col_run1)

g oth = cond(inrange(_n, 1, 5), "ZARATE", cond(inrange(_n, 6, 8), "WEISS", "YANG")) phoneticenc nm, col(nm_col_run2)

g oth2 = cond(inrange(_n, 1, 5), "MOORE", cond(inrange(_n, 6, 8), "PARKER", "JACKSON")) phoneticenc nm, col(nm_col_run3)

li

     +-----------------------------------------------------------------+
     |      nm   nm_col~1      oth   nm_col_~2      oth2   nm_col_run3 |
     |-----------------------------------------------------------------|
  1. |  ABONCE     016865   ZARATE   016887265     MOORE   01688726765 |
  2. |  ABONCE     016865   ZARATE   016887265     MOORE   01688726765 |
  3. |  ABONCE     016865   ZARATE   016887265     MOORE   01688726765 |
  4. |  ABONCE     016865   ZARATE   016887265     MOORE   01688726765 |
  5. |  ABONCE     016865   ZARATE   016887265     MOORE   01688726765 |
     |-----------------------------------------------------------------|
  6. |    ABDO      01265    WEISS     0123865    PARKER   01238174765 |
  7. |    ABDO      01265    WEISS     0123865    PARKER   01238174765 |
  8. |    ABDO      01265    WEISS     0123865    PARKER   01238174765 |
  9. | ACEVEDO     083265     YANG    08326465   JACKSON   08326448665 |
 10. | ACEVEDO     083265     YANG    08326465   JACKSON   08326448665 |
     +-----------------------------------------------------------------+

The three runs on the NM variable should be the same. Output of the encoding after each string addition changes. I think that's the place to focus.

wbuchanan commented 7 years ago

@sogervais

Seems like it was a stupid mistake on my part. I wasn't passing the varlist from the ado command to the call to the Java binary. You should be able to reinstall/update things from the github pages location now. I also did a bit of refactoring of some of your code from above to save you a few keystrokes in the example below.

. use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear 

. phoneticenc lastname, caverphone1(cav1_lastname)  caverphone2(cav2_lastname) dms(dms_lastname)  dblm(dblm_lastname)

. phoneticenc firstname, caverphone1(cav1_firstname) caverphone2(cav2_firstname) dms(dms_firstname)  dblm(dblm_firstname)

. g byte cav1_same = cav1_lastname == cav1_firstname

. g byte cav2_same = cav2_lastname == cav2_firstname

. g byte dms_same = dms_lastname == dms_firstname

. g byte dblm_same = dblm_lastname == dblm_firstname

. li lastname *_lastname *_same in 1/10

     +-----------------------------------------------------------------------------------------------------------+
     |        lastname   cav1_l~e   cav2_las~e   dms_la~e   dblm_l~e   cav1_s~e   cav2_s~e   dms_same   dblm_s~e |
     |-----------------------------------------------------------------------------------------------------------|
  1. | ABARCA              APK111   APKA111111     079400       APRK          0          0          0          0 |
  2. | ABDO                APT111   APTA111111     073000        APT          0          0          0          0 |
  3. | ABDO                APT111   APTA111111     073000        APT          0          0          0          0 |
  4. | ABEKAA              APK111   APKA111111     075000        APK          0          0          0          0 |
  5. | ABEYTA              APT111   APTA111111     073000        APT          0          0          0          0 |
     |-----------------------------------------------------------------------------------------------------------|
  6. | ABONCE              APNS11   APNK111111     076400       APNS          0          0          0          0 |
  7. | ABONCE              APNS11   APNK111111     076400       APNS          0          0          0          0 |
  8. | ABREGO              APRK11   APRKA11111     079500       APRK          0          0          0          0 |
  9. | ACEVEDO             ASFT11   ASFTA11111     047300       ASFT          0          0          0          0 |
 10. | ACEVEDO             ASFT11   ASFTA11111     047300       ASFT          0          0          0          0 |
     +-----------------------------------------------------------------------------------------------------------+
sogervais commented 7 years ago

Thanks. Mistakes happen and I am glad that you tracked this down.

My Stata skills need work so I appreciate the refactoring example. I can make things do what I want - just not so elegantly.

I'll re-install and move forward with incorporating your ADO package into my project.

wbuchanan commented 7 years ago

Awesome. Thanks for finding the bug for me and letting me know about it.