k3jph / phonics-in-r

Phonetic Spelling Algorithms in R
https://jameshoward.us/phonics-in-r
Other
28 stars 7 forks source link

NYSIIS encoding of 'CHRISTINA' #13

Closed ahood closed 5 years ago

ahood commented 5 years ago

Noticed phonics::nysiis('CHRISTINA') outputs 'CHRASTAN' (for maxCodeLen >= 8) whereas it should be 'CRASTAN' as per original algorithm (see https://naldc.nal.usda.gov/download/27833/PDF or https://www.springer.com/us/book/9780387695020 and the somewhat more vague https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System; can't find original report by Taft). Steps worked through here: christina.txt

Looks like discrepancy is due to the omission of the first letter of the name in nysiis.R line 107, i.e. word <- substr(word, 2, nchar(word)) before the application of the 'H' rule (Step 4.5).

howardjp commented 5 years ago

@ahood, thank you for posting this! I think there may be a discrepancy in expectations. There are two different algorithms that use the name NYSIIS and we use the "modified" flag to differentiate them. "Christina" is one of the cases with different outputs. Observe:

> nysiis("CHRISTINA", maxCodeLen = 10)
[1] "CHRASTAN"
> nysiis("CHRISTINA", maxCodeLen = 10, modified = TRUE)
[1] "CRASTAN"

You will find that different algorithms using the name nysiis may produce either output, so this is our solution.

ahood commented 5 years ago

What source did you use for your implementation of nysiis_original()?

Asking because the original source by Taft is hard to come by, and the steps written on Wikipedia (which may or may not be the same as Taft's) are awfully vague to the point where I've seen three different interpretations by three different implementers. The steps in Appendix B of this USDA report are much less ambiguous (although I don't know if they are different from Taft's), and include both an "original" and "modified" version; the "original" version is the one I used to arrive at "CRASTAN" as the encoding for Christina.

chrislit commented 5 years ago

Here's the relevant section of Taft's monograph: taft_nysiis.pdf. It's pretty much word-for-word identical to that USDA report version.

I'm also of the opinion that Christina ought to be encoded as CRASTAN, but more importantly that the NYSIIS implementation at http://www.dropby.com/NYSIISTextStrings.html is riddled with bugs. Taft's Step 6 makes it explicit that there shouldn't be sequences of repeated codes, but that page has outputs like NNAT, FFALAPSAN, & SSANAFT. I think the major error there is in Step 3 when the first character is preserved, after which it is never considered as part of the context of the second character of the name. And then it just gets slapped on the front of the resulting code, without considering the following context.

ahood commented 5 years ago

@chrislit Yes, agreed, I don't see where the idea of chopping off the first letter before processing the rest of the word and then gluing it back on came from, and it leads to contradictions, as you have pointed out.

howardjp commented 5 years ago

Okay, I am convinced. I'll get a patch together. Can you send a few test cases I can add to the test white to exercise it?

ahood commented 5 years ago

Just created a pull request with test cases and modifications to nysiis_original() (didn't touch nysiis_modified()). Verified that corrected and new test cases agree with output of abydos.phonetic.nysiis(), authored by @chrislit who also seemed to be working from the original source by Taft.

howardjp commented 5 years ago

Closing due to resolution with the pull request.

ahood commented 5 years ago

When will the change be reflected on CRAN? Thanks.

howardjp commented 5 years ago

Good question. CRAN is closed until after the New Year. I will put together the update in a day or so, but there's no rush since it won't update until January, regardless.

BTW, you seem to be pushing NYSIIS quite hard. Is there anything I can do to assist?

ahood commented 5 years ago

Great, just wondering when I should look out for it and update the packages at work.

I was translating some patient-matching code from R to Python, where some of the features involve comparisons between NYSIIS encodings, and I noticed that phonics::nysiis() and jellyfish.nysiis() gave me different results on some of my test cases. So then I saw those vague rules on Wikipedia that everyone was interpreting differently and all the different implementations and I just couldn't let it go :)

howardjp commented 5 years ago

Oh, yeah, I know the feeling! Since CRAN reopens today, I will be submitting this at some point today. I have no idea what kind of backlog on submissions they are expecting, so maybe 24-48 hours from now?

Cool, let me know if I can help any further!

On Tue, Jan 1, 2019 at 10:42 PM ahood notifications@github.com wrote:

Great, just wondering when I should look out for it and update the packages at work.

I was translating some patient-matching code from R to Python, where some of the features involve comparisons between NYSIIS encodings, and I noticed that phonics::nysiis() and jellyfish.nysiis() gave me different results on some of my test cases. So then I saw those vague rules on Wikipedia that everyone was interpreting differently and all the different implementations and I just couldn't let it go :)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/howardjp/phonics/issues/13#issuecomment-450782607, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJjahjFqi3B52UMpnyQhZOz0ag8di7-ks5u_CqggaJpZM4ZR44t .

howardjp commented 5 years ago

@ahood, phonics 1.2.3 has been approved for distribution and is going to CRAN now