Closed sogervais closed 7 years ago
Hi Stephen,
If you run query java
you should see a few different things you can set. Try setting the heapmax
value higher to avoid the out of memory issue. Are you able to provide a few string values that I can check on my end?
Hi William. I've left work for the day and will return on Monday. I've got some examples I'll pull together for you on the encoding output.
As for encoding beidermorse issue:
OK on heapmax - initial setting was 384m. I've raised it a couple times Issuing the command phoneticenc lastname, beidermorse(beidermorse) against the 3500 names in my PSAT dataset. Here's what happens.
I've bumped up the max to 1152m. Heap usage reports 1113m at failure.
I went up to 4608m. Heap usage reports 4541m at failure.
To what value should I set heapmax?
Stephen
Gremlins! Everything ran today without a hitch.
I reset the heapmax to the initial setting to duplicate my original conditions to provide some output.
Everything ran without a hitch. Problems with metaphone, double metaphone and nysiis producing suspect output did not return. Beider-Morse method works quickly with no memory error.
Sorry to bother you with what now appears to be a false alarm. I look forward to using the complete package.
Stephen
@sogervais Awesome. Glad to hear it is working. One of the things that I ended up learning a bit later was that all HTTPS requests spin up the JVM in Stata, so it may have had something to do with the JVM already being spun up. If there are other things that you think would be useful feel free to let me know.
Thanks for following up.
I think I have a better handle on what is happening when I run into trouble.
While I get output when I run the phoneticenc command, when comparing the first/last name encodings of reported student I find that the encodings for highly different names are virtually the same. I am not a Java programmer but this looks to me like some cache is not being cleared between runs of the phoneticenc command.
I am also still having heapmax errors using the beidermorse encoding despite upping the heatmax level. Yesterday I had it working with a heapmax setting of 1536m. Today, not working. Screenshot below with my output:
The commands below will let you load my same dataset to see if you are getting something different:
/* Duplicate Stephen's Phonetic Error Issues
Connect to my public dropbox, load name match file, and prepare phonetic encodings
My java setup has heapmax set to 1536 (Should this be higher?)
Stata environment Stata IC 14.2 Windows PC, Windows 7, i7 CPU with 16GB memory
*/
// Load Name File use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear
// Produce phonetic encoding for lastname phoneticenc lastname, caverphone1(cav1_lastname) caverphone2(cav2_lastname) dms(dms_lastname) dblm(dblm_lastname)
// Produce phonetic encoding for first name phoneticenc firstname, caverphone1(cav1_firstname) caverphone2(cav2_firstname) dms(dms_firstname) dblm(dblm_firstname)
// Still getting the heapmax error I had previously // uncomment to check heapmax error // phoneticenc lastname, beiderm(bm_lastname)
// Check how different these names are and sort by similarity strdist lastname firstname, levenshtein(lev_distance) strdist lastname firstname, jarowinklers(jw_similarity) sort jw_similarity
// Check if these encodings are the same gen cav1_same = 0 replace cav1_same = 1 if cav1_lastname == cav1_firstname
gen cav2_same = 0 replace cav2_same = 1 if cav2_lastname == cav2_firstname
gen dms_same = 0 replace dms_same = 1 if dms_lastname == dms_firstname
gen dblm_same = 0 replace dblm_same = 1 if dblm_lastname == dblm_firstname
Stephen
---- William Buchanan notifications@github.com wrote:
@sogervais Awesome. Glad to hear it is working. One of the things that I ended up learning a bit later was that all HTTPS requests spin up the JVM in Stata, so it may have had something to do with the JVM already being spun up. If there are other things that you think would be useful feel free to let me know.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/wbuchanan/StataStringUtilities/issues/3#issuecomment-271425458
@sogervais you should be able to set the heap max fairly high and leave it there. Java should only consume the memory needed to process things. I'll try to check things out in the morning, but it is possible for some of the phonetic encodings to return similar results for seemingly different strings depending on how the encodings work internally.
OK. I will max out the heapmax setting and see what happens tomorrow. For comparison purposes, I'm going to run the same list of names against a python version(s) of a phonetic encoder to see how the output differs.
See what happens when you switch the order of the names being processed. (Firstname then Lastname).
Can you print out a few of the results for the last name when you run things? Something bizzarre seems to be happening with the name field; it might be a string encoding issue where there are some values that are rendered in when viewing the data but is affecting the bytes in the data itself. For example, when I ran some of the code you have above I was getting different values for the same last name. When I manually create a data set with those same values with a few different amounts of duplication I get the anticipated behavior:
clear
. set obs 10
number of observations (_N) was 0, now 10
. g nm = cond(inrange(_n, 1, 5), "ABONCE", cond(inrange(_n, 6, 8), "ABDO", "ACEVEDO"))
. phoneticenc nm, caverphone1(cav1) caverphone2(cav2) dms(dms) dblm(dblm) beiderm(bm)
. li
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nm cav1 cav2 dms dblm bm |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
2. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
3. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
4. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
5. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6. | ABDO APTN11 APTNA11111 073680 APTN YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
7. | ABDO APTN11 APTNA11111 073680 APTN YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
8. | ABDO APTN11 APTNA11111 073680 APTN YbdYnul|Ybdonul|abdYnul|abdonul|avdonul|obdYnul|obdonul|ovdonul |
9. | ACEVEDO ASFTN1 ASFTNA1111 047368 ASFT akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
10. | ACEVEDO ASFTN1 ASFTNA1111 047368 ASFT akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I think there is likely some string encoding issues that may not have been handled:
. use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear
. desc
Contains data from https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta
obs: 3,275
vars: 3 10 Jan 2017 14:49
size: 114,625
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
stu_id double %9.0g
lastname str15 %15s Last Name
firstname str12 %12s First Name
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: lastname firstname
. count if lastname == "ABONCE"
0
. li lastname in 1/10
+-----------------+
| lastname |
|-----------------|
1. | ABARCA |
2. | ABDO |
3. | ABDO |
4. | ABEKAA |
5. | ABEYTA |
|-----------------|
6. | ABONCE |
7. | ABONCE |
8. | ABREGO |
9. | ACEVEDO |
10. | ACEVEDO |
+-----------------+
. loc x : di lastname[6]
. di `"`x'"'
ABONCE
. count if lastname == "ABONCE"
0
. count if lastname == `"`x'"'
2
It seems entering the value ABONCE
from the keyboard is not the same as the value stored in the data itself. If you are starting from a plain text file you might want to check out the unicode
commands to see whether or not there are multiple string encodings in the file or if the file has a different encoding than the system was expecting. I ended up winning a bet from someone back in Minneapolis related to string encoding issues and they are usually really difficult to identify unless you've run into them a bunch in the past and are thinking about them as a potential issue all the time.
Hi. Thanks for looking into this with me. I had run some values as you had when I thought all was fine. Switching back to my name match list, I got these weird results. I appreciate your assistance in looking into this.
My name table is being created on the fly as an import via ODBC from my MS SQL Server 2012 research database. Name values in the database are stored as varchars and are just basic ASCII on import into Stata. No difference in output whether it is from the original file or the CSV.
Output from Plaintext CSV file
+---------------------------------------------------------------+
| lastname cav1_l~e cav2_las~e dms_la~e dblm_l~e |
|---------------------------------------------------------------|
1. | ABARCA NLPKKP NLPKKPTNA1 687945 NLPR |
2. | ABDO NLPTSR NLPTSRNA11 687349 NLPT |
3. | ABDO NLPTSN NLPTSNA111 687345 NLPT |
4. | ABEKAA NLPKPM NLPKPMLNA1 687576 NLPK |
5. | ABEYTA NLPTMN NLPTMNNA11 687366 NLPT |
|---------------------------------------------------------------|
6. | ABONCE NLPNSK NLPNSKTLPN 687645 NLPN |
7. | ABONCE NLPNSM NLPNSMRNA1 687646 NLPN |
8. | ABREGO NLPRKL NLPRKLNNA1 687958 NLPR |
9. | ACEVEDO NLSFTK NLSFTKRSTF 684734 NLSF |
10. | ACEVEDO NLSFTT NLSFTTNNA1 684733 NLSF |
+---------------------------------------------------------------+
Output from SQL Server
+---------------------------------------------------------------+
| lastname cav1_l~e cav2_las~e dms_la~e dblm_l~e |
|---------------------------------------------------------------|
1. | ABARCA NLPKKP NLPKKPTNA1 687945 NLPR |
2. | ABDO NLPTSR NLPTSRNA11 687349 NLPT |
3. | ABDO NLPTSN NLPTSNA111 687345 NLPT |
4. | ABEKAA NLPKPM NLPKPMLNA1 687576 NLPK |
5. | ABEYTA NLPTMN NLPTMNNA11 687366 NLPT |
|---------------------------------------------------------------|
6. | ABONCE NLPNSK NLPNSKTLPN 687645 NLPN |
7. | ABONCE NLPNSM NLPNSMRNA1 687646 NLPN |
8. | ABREGO NLPRKL NLPRKLNNA1 687958 NLPR |
9. | ACEVEDO NLSFTK NLSFTKRSTF 684734 NLSF |
10. | ACEVEDO NLSFTT NLSFTTNNA1 684733 NLSF |
+---------------------------------------------------------------+
I can duplicate your output including the beidermorse encoding
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nm cav1 cav2 dms dblm bm |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul |
2. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
3. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
4. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul |
5. | ABONCE APNSN1 APNSNA1111 076468 APNS abonkinul|abontsinul|abonzinul|abuntsinul|abunzinul|avonzinul|obonkinul|obontsinul|obonzinul|obuntsinul|obunzinul|ovonzinul |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6. | ABDO APTN11 APTNA11111 073680 APTN abdonul|avdonul|obdonul|ovdonul |
7. | ABDO APTN11 APTNA11111 073680 APTN abdonul|avdonul|obdonul|ovdonul |
8. | ABDO APTN11 APTNA11111 073680 APTN abdonul|avdonul|obdonul|ovdonul |
9. | ACEVEDO ASFTN1 ASFTNA1111 047368 ASFT akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
10. | ACEVEDO ASFTN1 ASFTNA1111 047368 ASFT akividonul|asibidonul|asividonul|atsividonul|azividonul|okividonul|osibidonul|osividonul|otsividonul|ozividonul |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
However, if I use the same technique to add a second name column g oth = cond(inrange(_n, 1, 5), "ZARATE", cond(inrange(_n, 6, 8), "WEISS", "YANG")) before I run the phoneticenc command, then the reported output differs when I list.
The OTH variable should not influence the encoding of the NM variable, correct?
+--------------------------------------------------------+
| nm oth cav1 cav2 dms dblm |
|--------------------------------------------------------|
1. | ABONCE ZARATE APNSSR APNSSRTNA1 076449 APNS |
2. | ABONCE ZARATE APNSSR APNSSRTNA1 076449 APNS |
3. | ABONCE ZARATE APNSSR APNSSRTNA1 076449 APNS |
4. | ABONCE ZARATE APNSSR APNSSRTNA1 076449 APNS |
5. | ABONCE ZARATE APNSSR APNSSRTNA1 076449 APNS |
|--------------------------------------------------------|
6. | ABDO WEISS APTWSN APTWSNA111 073746 APTS |
7. | ABDO WEISS APTWSN APTWSNA111 073746 APTS |
8. | ABDO WEISS APTWSN APTWSNA111 073746 APTS |
9. | ACEVEDO YANG ASFTYN ASFTNKNA11 047316 ASFT |
10. | ACEVEDO YANG ASFTYN ASFTNKNA11 047316 ASFT |
+--------------------------------------------------------+
As far as I can tell, the difference does not appear to be unicode related. To me it seems that the two string are bleeding together somehow. I've tested this using a variation on your procedure using the cologne encoding.
set obs 10
g nm = cond(inrange(_n, 1, 5), "ABONCE", cond(inrange(_n, 6, 8), "ABDO", "ACEVEDO")) phoneticenc nm, col(nm_col_run1)
g oth = cond(inrange(_n, 1, 5), "ZARATE", cond(inrange(_n, 6, 8), "WEISS", "YANG")) phoneticenc nm, col(nm_col_run2)
g oth2 = cond(inrange(_n, 1, 5), "MOORE", cond(inrange(_n, 6, 8), "PARKER", "JACKSON")) phoneticenc nm, col(nm_col_run3)
li
+-----------------------------------------------------------------+
| nm nm_col~1 oth nm_col_~2 oth2 nm_col_run3 |
|-----------------------------------------------------------------|
1. | ABONCE 016865 ZARATE 016887265 MOORE 01688726765 |
2. | ABONCE 016865 ZARATE 016887265 MOORE 01688726765 |
3. | ABONCE 016865 ZARATE 016887265 MOORE 01688726765 |
4. | ABONCE 016865 ZARATE 016887265 MOORE 01688726765 |
5. | ABONCE 016865 ZARATE 016887265 MOORE 01688726765 |
|-----------------------------------------------------------------|
6. | ABDO 01265 WEISS 0123865 PARKER 01238174765 |
7. | ABDO 01265 WEISS 0123865 PARKER 01238174765 |
8. | ABDO 01265 WEISS 0123865 PARKER 01238174765 |
9. | ACEVEDO 083265 YANG 08326465 JACKSON 08326448665 |
10. | ACEVEDO 083265 YANG 08326465 JACKSON 08326448665 |
+-----------------------------------------------------------------+
The three runs on the NM variable should be the same. Output of the encoding after each string addition changes. I think that's the place to focus.
@sogervais
Seems like it was a stupid mistake on my part. I wasn't passing the varlist
from the ado command to the call to the Java binary. You should be able to reinstall/update things from the github pages location now. I also did a bit of refactoring of some of your code from above to save you a few keystrokes in the example below.
. use https://dl.dropboxusercontent.com/u/6790118/stata_phonetic/Name_Match.dta, clear
. phoneticenc lastname, caverphone1(cav1_lastname) caverphone2(cav2_lastname) dms(dms_lastname) dblm(dblm_lastname)
. phoneticenc firstname, caverphone1(cav1_firstname) caverphone2(cav2_firstname) dms(dms_firstname) dblm(dblm_firstname)
. g byte cav1_same = cav1_lastname == cav1_firstname
. g byte cav2_same = cav2_lastname == cav2_firstname
. g byte dms_same = dms_lastname == dms_firstname
. g byte dblm_same = dblm_lastname == dblm_firstname
. li lastname *_lastname *_same in 1/10
+-----------------------------------------------------------------------------------------------------------+
| lastname cav1_l~e cav2_las~e dms_la~e dblm_l~e cav1_s~e cav2_s~e dms_same dblm_s~e |
|-----------------------------------------------------------------------------------------------------------|
1. | ABARCA APK111 APKA111111 079400 APRK 0 0 0 0 |
2. | ABDO APT111 APTA111111 073000 APT 0 0 0 0 |
3. | ABDO APT111 APTA111111 073000 APT 0 0 0 0 |
4. | ABEKAA APK111 APKA111111 075000 APK 0 0 0 0 |
5. | ABEYTA APT111 APTA111111 073000 APT 0 0 0 0 |
|-----------------------------------------------------------------------------------------------------------|
6. | ABONCE APNS11 APNK111111 076400 APNS 0 0 0 0 |
7. | ABONCE APNS11 APNK111111 076400 APNS 0 0 0 0 |
8. | ABREGO APRK11 APRKA11111 079500 APRK 0 0 0 0 |
9. | ACEVEDO ASFT11 ASFTA11111 047300 ASFT 0 0 0 0 |
10. | ACEVEDO ASFT11 ASFTA11111 047300 ASFT 0 0 0 0 |
+-----------------------------------------------------------------------------------------------------------+
Thanks. Mistakes happen and I am glad that you tracked this down.
My Stata skills need work so I appreciate the refactoring example. I can make things do what I want - just not so elegantly.
I'll re-install and move forward with incorporating your ADO package into my project.
Awesome. Thanks for finding the bug for me and letting me know about it.
Hi. I've been working with the strutil package on a dataset of nearly 3500 student records. I'm matching records to my master list of students and comparing names to be sure student results are encoded to the correct students.
Anyway, I'm using the phoneticenc function. The metaphone, double metaphone and nysiis methods are returning only a single value for all 3500 records. (metaphone & double metapnone returns "NLNL"; nysiis returns "NALNAL"). The beidermorse method hangs up and returns the screenshot below:
Everything else seems to be working as expected.
Stephen