lcnetdev / scriptshifter

Creative Commons Zero v1.0 Universal
15 stars 6 forks source link

Chinese: MARC field option is ignored #133

Open scossu opened 3 weeks ago

scossu commented 3 weeks ago

Source string: 欒保羣

Result: Luan bao qun

Expected result: Luan, Baoqun

Other examples involving the MARC field option are behaving similarly. This may be a regression from the DB migration.

@tventimi FYI

scossu commented 2 weeks ago

On a second look, it looks like I had never implemented this functionality. I had used the MARC field for numeral parsing, but in https://github.com/lcnetdev/scriptshifter/blob/main/tests/data/script_samples/chinese.csv some tests mention fields 100 and 700 are to be handled as names where a comma is added. @tventimi can you provide more details on this logic?

tventimi commented 2 weeks ago

See the following code snippet from Parallelogram:

https://github.com/pulibrary/parallelogram/blob/main/cloudapp/src/app/pinyin.service.ts#L135-L149

This code is run on the romanized version of the name. It is assumed that the name consists of two or three separate "words". The first word is capitalized and followed by a comma. The second word is also capitalized. If there is a third word, it is appended to the second one with no capitalization and no space in between. However, if the third word begins with a vowel, then an apostrophe is placed between the second and third words. Thus,

Luan bao quan --> Luan, Baoquan Wen dao an --> Wen, Dao'an Xia jing --> Xia, Jing Sima qian --> Sima, Qian

Note that in the last example, the surname is multisyllable and corresponds to two Chinese characters (司马). However, the code doesn't need to know this because these characters have already been romanized and written as a single word by the time it reaches this point in the code.

Also note that the code snippet above applies this logic to subfield r of any MARC field, but in such cases, the comma after the first word is omitted.

scossu commented 1 week ago

In the code snippet you linked to, what do the tag, ind1, and code variables represent?

scossu commented 1 week ago

Fixed in #139.

scossu commented 1 week ago

@tventimi can you please test? I only had Sima, Qian in Chinese script to test with.

I might have to adjust the code to select the MARC field. At the moment it only applies to 100, 600, 700, 800.

tventimi commented 1 week ago

I tested some more examples and confirmed that the name formatting is correct.