Traditional and simplified Chinese searching not behaving interchangeably

ruthtillman commented 2 years ago

Just received a report that our traditional and simplified Chinese character results are not behaving interchangeably. This means that if materials are described with simplified Chinese characters but the person searches using traditional Chinese, they will not find the materials. And vice-versa.. The CAT handled this successfully as does WorldCat.

Examples provided were:

梵蒂冈图书馆所藏汉籍目录  / 梵蒂岡圖書館所藏漢籍目錄

林则徐全集 / 林則徐全集

I can work with our Chinese-language catalogers to get some more examples.

It's a high priority because it affects access to all our Chinese materials. Unfortunately, this was only just now reported because the reporter had not been using the Catalog until the CAT went away.

CBeer shared in Blacklight Slack that they handle it this way:

https://github.com/sul-dlss/SearchWorks/blob/master/config/solr_configs/schema.xml#L481

and with https://github.com/sul-dlss/CJKFilterUtils

JRochkind suggested we look for a Solr analyzer. We might also be able to work with Radu on it?

ruthtillman commented 2 years ago

Michael Gibney at Penn adds:

You want solr.ICUTransformFilterFactory: https://solr.apache.org/guide/8_11/filter-descriptions.html#icu-transform-filter. The Stanford and Princeton examples are great; I'll add this (adapted/redacted) from Penn:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
        <charFilter class="solr.ICUTransformCharFilterFactory" id="Traditional-Simplified"/>
        <charFilter class="solr.ICUTransformCharFilterFactory" id="Katakana-Hiragana" />
        <tokenizer class="solr.ICUTokenizerFactory" />
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

But stock Solr does not have ICUTransform as a charFilter, only as a post-tokenization filter, so you'd need:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
        <tokenizer class="solr.ICUTokenizerFactory" />
        <charFilter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
        <charFilter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana" />
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

ruthtillman commented 2 years ago

Gibney, continued (the above config refers to this file):

# the context of CJK vernacular scripts in bibliographic metadata.
# The list aims to unify searching for equivalent CJK characters that
# may have different Unicode encodings and/or presentation forms based
# on their input methods. It is derived from Princeton University
# Library's "IME variants not present in the EACC/MARC21 character
# sets" (https://library.princeton.edu/projects/eacc/oldindex.htm)
# and the Library of Congress Cataloging Policy and Support Office's
# CJK Compatibility Database (https://www.loc.gov/ils/cjk_search/cjk_cpso.html),
# both of which were created to support the work of catalogers and
# users of CJK scripts in MARC-21 records.

# ‐ => -
"\u2010" => "\u002D"

# ｢ => 「
"\uFF62" => "\u300C"

# ｣ => 」
"\uFF63" => "\u300D"

# 『 => 「
"\u300E" => "\u300C"

# 』 => 」
"\u300F" => "\u300D"

# ＜ => <
"\uFF1C" => "\u003C"

# ０ => 〇
"\uFF10" => "\u3007"

# 不 => 不
"\uF967" => "\u4E0D"

# 串 => 串
"\uF905" => "\u4E32"

# 丹 => 丹
"\uF95E" => "\u4E39"

# 亂 => 亂
"\uF91B" => "\u4E82"

# 了 => 了
"\uF9BA" => "\u4E86"

# 亮 => 亮
"\uF977" => "\u4EAE"

# 亷 => 廉
"\u4EB7" => "\u5EC9"

# 什 => 什
"\uF9FD" => "\u4EC0"

# 令 => 令
"\uF9A8" => "\u4EE4"

# 來 => 來
"\uF92D" => "\u4F86"

# 例 => 例
"\uF9B5" => "\u4F8B"

# 便 => 便
"\uF965" => "\u4FBF"

# 倫 => 倫
"\uF9D4" => "\u502B"

# 倶 => 俱
"\u5036" => "\u4FF1"

# 僚 => 僚
"\uF9BB" => "\u50DA"

# 兀 => 兀
"\uFA0C" => "\u5140"

# 兩 => 兩
"\uF978" => "\u5169"

# 兪 => 俞
"\u516A" => "\u4FDE"

# 六 => 六
"\uF9D1" => "\u516D"

# 冷 => 冷
"\uF92E" => "\u51B7"

# 凉 => 凉
"\uF979" => "\u51C9"

# 凌 => 凌
"\uF955" => "\u51CC"

# 凜 => 凜
"\uF954" => "\u51DC"

# 切 => 切
"\uFA00" => "\u5207"

# 列 => 列
"\uF99C" => "\u5217"

# 利 => 利
"\uF9DD" => "\u5229"

# 刪 => 删
"\u522A" => "\u5220"

# 刺 => 刺
"\uF9FF" => "\u523A"

# 劉 => 劉
"\uF9C7" => "\u5289"

# 力 => 力
"\uF98A" => "\u529B"

# 劣 => 劣
"\uF99D" => "\u52A3"

# 勒 => 勒
"\uF952" => "\u52D2"

# 勞 => 勞
"\uF92F" => "\u52DE"

# 勵 => 勵
"\uF97F" => "\u52F5"

# 匀 => 勻
"\u5300" => "\u52FB"

# 北 => 北
"\uF963" => "\u5317"

# 匿 => 匿
"\uF9EB" => "\u533F"

# 卵 => 卵
"\uF91C" => "\u5375"

# 參 => 參
"\uF96B" => "\u53C3"

# 句 => 句
"\uF906" => "\u53E5"

# 吏 => 吏
"\uF9DE" => "\u540F"

# 吝 => 吝
"\uF9ED" => "\u541D"

# 呂 => 呂
"\uF980" => "\u5442"

# 咽 => 咽
"\uF99E" => "\u54BD"

# 喇 => 喇
"\uF90B" => "\u5587"

# 囹 => 囹
"\uF9A9" => "\u56F9"

# 圏 => 圈
"\u570F" => "\u5708"

# 塚 => 塚
"\uFA10" => "\u585A"

# 塞 => 塞
"\uF96C" => "\u585E"

# 壘 => 壘
"\uF94A" => "\u58D8"

# 壟 => 壟
"\uF942" => "\u58DF"

# 奈 => 奈
"\uF90C" => "\u5948"

# 契 => 契
"\uF909" => "\u5951"

# 女 => 女
"\uF981" => "\u5973"

# 姫 => 姬
"\u59EB" => "\u59EC"

# 宅 => 宅
"\uFA04" => "\u5B85"

# 宫 => 宮
"\u5BAB" => "\u5BAE"

# 寛 => 寬
"\u5BDB" => "\u5BEC"

# 寧 => 寧
"\uF95F" => "\u5BE7"

# 寧 => 寧
"\uF9AA" => "\u5BE7"

# 寮 => 寮
"\uF9BC" => "\u5BEE"

# 尙 => 尚
"\u5C19" => "\u5C1A"

# 尭 => 堯
"\u5C2D" => "\u582F"

# 尿 => 尿
"\uF9BD" => "\u5C3F"

# 屛 => 摒
"\u5C5B" => "\u6452"

# 屢 => 屢
"\uF94B" => "\u5C62"

# 履 => 履
"\uF9DF" => "\u5C65"

# 崙 => 崙
"\uF9D5" => "\u5D19"

# 嵐 => 嵐
"\uF921" => "\u5D50"

# 嶺 => 嶺
"\uF9AB" => "\u5DBA"

# 巗 => 巖
"\u5DD7" => "\u5DD6"

# 巻 => 卷
"\u5BFD" => "\u5377"

# 巻 => 卷
"\u5DFB" => "\u5377"

# 帲 => 帡
"\u5E32" => "\u5E21"

# 年 => 年
"\uF98E" => "\u5E74"

# 度 => 度
"\uFA01" => "\u5EA6"

# 廉 => 廉
"\uF9A2" => "\u5EC9"

# 廊 => 廊
"\uF928" => "\u5ECA"

# 廓 => 廓
"\uFA0B" => "\u5ED3"

# 廬 => 廬
"\uF982" => "\u5EEC"

# 弄 => 弄
"\uF943" => "\u5F04"

# 彚 => 彙
"\u5F5A" => "\u5F59"

# 律 => 律
"\uF9D8" => "\u5F8B"

# 復 => 復
"\uF966" => "\u5FA9"

# 念 => 念
"\uF9A3" => "\u5FF5"

# 怒 => 怒
"\uF960" => "\u6012"

# 怜 => 怜
"\uF9AC" => "\u601C"

# 惡 => 惡
"\uF9B9" => "\u60E1"

# 慄 => 慄
"\uF9D9" => "\u6144"

# 憐 => 憐
"\uF98F" => "\u6190"

# 懶 => 懶
"\uF90D" => "\u61F6"

# 戀 => 戀
"\uF990" => "\u6200"

# 戮 => 戮
"\uF9D2" => "\u622E"

# 戱 => 戯
"\u6231" => "\u622F"

# 户 => 戶
"\u6237" => "\u6236"

# 戸 => 戶
"\u6238" => "\u6236"

# 拉 => 拉
"\uF925" => "\u62C9"

# 拏 => 拏
"\uF95B" => "\u62CF"

# 拓 => 拓
"\uFA02" => "\u62D3"

# 拾 => 拾
"\uF973" => "\u62FE"

# 捻 => 捻
"\uF9A4" => "\u637B"

# 掠 => 掠
"\uF975" => "\u63A0"

# 揅 => 研
"\u63C5" => "\u7814"

# 揑 => 捏
"\u63D1" => "\u634F"

# 揷 => 挿
"\u63F7" => "\u633F"

# 揺 => 摇
"\u63FA" => "\u6447"

# 撚 => 撚
"\uF991" => "\u649A"

# 敍 => 敘
"\u654D" => "\u6558"

# 數 => 數
"\uF969" => "\u6578"

# 料 => 料
"\uF9BE" => "\u6599"

# 旅 => 旅
"\uF983" => "\u65C5"

# 易 => 易
"\uF9E0" => "\u6613"

# 昻 => 昂
"\u663B" => "\u6602"

# 晩 => 晚
"\u6669" => "\u665A"

# 晴 => 晴
"\uFA12" => "\u6674"

# 暈 => 暈
"\uF9C5" => "\u6688"

# 暴 => 暴
"\uFA06" => "\u66B4"

# 曆 => 曆
"\uF98B" => "\u66C6"

# 更 => 更
"\uF901" => "\u66F4"

# 朗 => 朗
"\uF929" => "\u6717"

# 李 => 李
"\uF9E1" => "\u674E"

# 杻 => 杻
"\uF9C8" => "\u677B"

# 林 => 林
"\uF9F4" => "\u6797"

# 柳 => 柳
"\uF9C9" => "\u67F3"

# 査 => 查
"\u67FB" => "\u67E5"

# 栗 => 栗
"\uF9DA" => "\u6817"

# 梁 => 梁
"\uF97A" => "\u6881"

# 梨 => 梨
"\uF9E2" => "\u68A8"

# 樂 => 樂
"\uF914" => "\u6A02"

# 樂 => 樂
"\uF95C" => "\u6A02"

# 樂 => 樂
"\uF9BF" => "\u6A02"

# 樓 => 樓
"\uF94C" => "\u6A13"

# 櫓 => 櫓
"\uF931" => "\u6AD3"

# 欄 => 欄
"\uF91D" => "\u6B04"

# 歩 => 步
"\u6B69" => "\u6B65"

# 歳 => 歲
"\u6B73" => "\u6B72"

# 歷 => 歷
"\uF98C" => "\u6B77"

# 殮 => 殮
"\uF9A5" => "\u6BAE"

# 殺 => 殺
"\uF970" => "\u6BBA"

# 毎 => 每
"\u6BCE" => "\u6BCF"

# 汚 => 污
"\u6C5A" => "\u6C61"

# 沈 => 沈
"\uF972" => "\u6C88"

# 沨 => 渢
"\u6CA8" => "\u6E22"

# 泌 => 泌
"\uF968" => "\u6CCC"

# 泥 => 泥
"\uF9E3" => "\u6CE5"

# 洛 => 洛
"\uF915" => "\u6D1B"

# 洞 => 洞
"\uFA05" => "\u6D1E"

# 流 => 流
"\uF9CA" => "\u6D41"

# 浪 => 浪
"\uF92A" => "\u6D6A"

# 淋 => 淋
"\uF9F5" => "\u6DCB"

# 淚 => 淚
"\uF94D" => "\u6DDA"

# 淪 => 淪
"\uF9D6" => "\u6DEA"

# 渉 => 涉
"\u6E09" => "\u6D89"

# 溜 => 溜
"\uF9CB" => "\u6E9C"

# 溺 => 溺
"\uF9EC" => "\u6EBA"

# 滑 => 滑
"\uF904" => "\u6ED1"

# 漏 => 漏
"\uF94E" => "\u6F0F"

# 漣 => 漣
"\uF992" => "\u6F23"

# 潊 => 潊
"\u6F4A" => "\u6F35"

# 濫 => 濫
"\uF922" => "\u6FEB"

# 濵 => 濱
"\u6FF5" => "\u6FF1"

# 濾 => 濾
"\uF984" => "\u6FFE"

# 炙 => 炙
"\uF9FB" => "\u7099"

# 烈 => 烈
"\uF99F" => "\u70C8"

# 烙 => 烙
"\uF916" => "\u70D9"

# 煉 => 煉
"\uF993" => "\u7149"

# 煕 => 熙
"\u7155" => "\u7199"

# 燎 => 燎
"\uF9C0" => "\u71CE"

# 燐 => 燐
"\uF9EE" => "\u71D0"

# 爐 => 爐
"\uF932" => "\u7210"

# 爛 => 爛
"\uF91E" => "\u721B"

# 爲 => 為
"\u7232" => "\u70BA"

# 牢 => 牢
"\uF946" => "\u7262"

# 狀 => 狀
"\uF9FA" => "\u72C0"

# 狼 => 狼
"\uF92B" => "\u72FC"

# 猪 => 猪
"\uFA16" => "\u732A"

# 獵 => 獵
"\uF9A7" => "\u7375"

# 率 => 率
"\uF961" => "\u7387"

# 率 => 率
"\uF9DB" => "\u7387"

# 玲 => 玲
"\uF9AD" => "\u73B2"

# 珞 => 珞
"\uF917" => "\u73DE"

# 理 => 理
"\uF9E4" => "\u7406"

# 琉 => 琉
"\uF9CC" => "\u7409"

# 瑩 => 瑩
"\uF9AE" => "\u7469"

# 瑶 => 瑤
"\u7476" => "\u7464"

# 璉 => 璉
"\uF994" => "\u7489"

# 璘 => 璘
"\uF9EF" => "\u7498"

# 甁 => 瓶
"\u7501" => "\u74F6"

# 留 => 留
"\uF9CD" => "\u7559"

# 略 => 略
"\uF976" => "\u7565"

# 異 => 異
"\uF962" => "\u7570"

# 痢 => 痢
"\uF9E5" => "\u7500"

# 療 => 療
"\uF9C1" => "\u7642"

# 癩 => 癩
"\uF90E" => "\u7669"

# 益 => 益
"\uFA17" => "\u76CA"

# 盧 => 盧
"\uF933" => "\u76E7"

# 省 => 省
"\uF96D" => "\u7701"

# 硫 => 硫
"\uF9CE" => "\u786B"

# 碌 => 碌
"\uF93B" => "\u788C"

# 磊 => 磊
"\uF947" => "\u78CA"

# 磻 => 磻
"\uF964" => "\u78FB"

# 礪 => 礪
"\uF985" => "\u792A"

# 礼 => 礼
"\uFA18" => "\u793C"

# 神 => 神
"\uFA19" => "\u795E"

# 祥 => 祥
"\uFA1A" => "\u7965"

# 祿 => 祿
"\uF93C" => "\u797F"

# 福 => 福
"\uFA1B" => "\u798F"

# 禮 => 禮
"\uF9B6" => "\u79AE"

# 秊 => 秊
"\uF995" => "\u79CA"

# 税 => 稅
"\u7A0E" => "\u7A05"

# 稜 => 稜
"\uF956" => "\u7A1C"

# 立 => 立
"\uF9F7" => "\u7ACB"

# 笠 => 笠
"\uF9F8" => "\u7B20"

# 簾 => 簾
"\uF9A6" => "\u7C3E"

# 籠 => 籠
"\uF944" => "\u7C60"

# 粒 => 粒
"\uF9F9" => "\u7C92"

# 粵 => 粤
"\u7CB5" => "\u7CA4"

# 精 => 精
"\uFA1D" => "\u7CBE"

# 糖 => 糖
"\uFA03" => "\u7CD6"

# 糧 => 糧
"\uF97B" => "\u7CE7"

# 紐 => 紐
"\uF9CF" => "\u7D10"

# 索 => 索
"\uF96A" => "\u7D22"

# 累 => 累
"\uF94F" => "\u7D2F"

# 綠 => 綠
"\uF93D" => "\u7DA0"

# 綾 => 綾
"\uF957" => "\u7DBE"

# 緃 => 縱
"\u7DC3" => "\u7E31"

# 緒 => 緖
"\u7DD2" => "\u7DD6"

# 練 => 練
"\uF996" => "\u7DF4"

# 縷 => 縷
"\uF950" => "\u7E37"

# 繋 => 繫
"\u7E4B" => "\u7E6B"

# 繍 => 繡
"\u7E4D" => "\u7E61"

# 罹 => 罹
"\uF9E6" => "\u7F79"

# 羅 => 羅
"\uF90F" => "\u7F85"

# 羚 => 羚
"\uF9AF" => "\u7F9A"

# 羽 => 羽
"\uFA1E" => "\u7FBD"

# 老 => 老
"\uF934" => "\u8001"

# 聆 => 聆
"\uF9B0" => "\u8046"

# 聯 => 聯
"\uF997" => "\u806F"

# 聾 => 聾
"\uF945" => "\u807E"

# 肋 => 肋
"\uF953" => "\u808B"

# 脱 => 脫
"\u8131" => "\u812B"

# 臘 => 臘
"\uF926" => "\u81D8"

# 臨 => 臨
"\uF9F6" => "\u81E8"

# 良 => 良
"\uF97C" => "\u826F"

# 若 => 若
"\uF974" => "\u82E5"

# 茶 => 茶
"\uF9FE" => "\u8336"

# 荆 => 荊
"\u8346" => "\u834A"

# 菉 => 菉
"\uF93E" => "\u83C9"

# 菱 => 菱
"\uF958" => "\u83F1"

# 落 => 落
"\uF918" => "\u843D"

# 葉 => 葉
"\uF96E" => "\u8449"

# 蓮 => 蓮
"\uF999" => "\u84EE"

# 蓼 => 蓼
"\uF9C2" => "\u84FC"

# 薫 => 薰
"\u85AB" => "\u85B0"

# 藍 => 藍
"\uF923" => "\u85CD"

# 藺 => 藺
"\uF9F0" => "\u85FA"

# 蘆 => 蘆
"\uF935" => "\u8606"

# 蘭 => 蘭
"\uF91F" => "\u862D"

# 蘿 => 蘿
"\uF910" => "\u863F"

# 虚 => 虛
"\u865A" => "\u865B"

# 虜 => 虜
"\uF936" => "\u865C"

# 螺 => 螺
"\uF911" => "\u87BA"

# 﨑 => 崎
"\uFA11" => "\u5D0E"

# 蠟 => 蠟
"\uF927" => "\u881F"

# 行 => 行
"\uFA08" => "\u884C"

# 裂 => 裂
"\uF9A0" => "\u88C2"

# 裏 => 裏
"\uF9E7" => "\u88CF"

# 裡 => 裡
"\uF9E8" => "\u88E1"

# 裵 => 裴
"\u88F5" => "\u88F4"

# 裸 => 裸
"\uF912" => "\u88F8"

# 襤 => 襤
"\uF924" => "\u8964"

# 見 => 見
"\uFA0A" => "\u898B"

# 說 => 說
"\uF96F" => "\u8AAA"

# 說 => 說
"\uF9A1" => "\u8AAA"

# 説 => 說
"\u8AAC" => "\u8AAA"

# 諒 => 諒
"\uF97D" => "\u8AD2"

# 論 => 論
"\uF941" => "\u8AD6"

# 諸 => 諸
"\uFA22" => "\u8AF8"

# 諾 => 諾
"\uF95D" => "\u8AFE"

# 識 => 識
"\uF9FC" => "\u8B58"

# 讀 => 讀
"\uF95A" => "\u8B80"

# 豈 => 豈
"\uF900" => "\u8C48"

# 賂 => 賂
"\uF948" => "\u8CC2"

# 賈 => 賈
"\uF903" => "\u8CC8"

# 路 => 路
"\uF937" => "\u8DEF"

# 車 => 車
"\uF902" => "\u8ECA"

# 輦 => 輦
"\uF998" => "\u8F26"

# 輪 => 輪
"\uF9D7" => "\u8F2A"

# 輻 => 輻
"\uFA07" => "\u8F3B"

# 轢 => 轢
"\uF98D" => "\u8F62"

# 辰 => 辰
"\uF971" => "\u8FB0"

# 連 => 連
"\uF99A" => "\u9023"

# 逸 => 逸
"\uFA25" => "\u9038"

# 遼 => 遼
"\uF9C3" => "\u907C"

# 邏 => 邏
"\uF913" => "\u908F"

# 郎 => 郎
"\uF92C" => "\u90CE"

# 郞 => 郎
"\u90DE" => "\u90CE"

# 郷 => 鄉
"\u90F7" => "\u9109"

# 都 => 都
"\uFA26" => "\u90FD"

# 鄕 => 鄉
"\u9115" => "\u9109"

# 酪 => 酪
"\uF919" => "\u916A"

# 醴 => 醴
"\uF9B7" => "\u91B4"

# 里 => 里
"\uF9E9" => "\u91CC"

# 量 => 量
"\uF97E" => "\u91CF"

# 金 => 金
"\uF90A" => "\u91D1"

# 鈴 => 鈴
"\uF9B1" => "\u9234"

# 鋭 => 銳
"\u92ED" => "\u92B3"

# 錄 => 錄
"\uF93F" => "\u9304"

# 録 => 錄
"\u9332" => "\u9304"

# 鍊 => 鍊
"\uF99B" => "\u934A"

# 鎸 => 鐫
"\u93B8" => "\u942B"

# 鐠 => 镨
"\u9420" => "\u9568"

# 閭 => 閭
"\uF986" => "\u95AD"

# 閲 => 閱
"\u95B2" => "\u95B1"

# 闗 => 關
"\u95D7" => "\u95DC"

# 阮 => 阮
"\uF9C6" => "\u962E"

# 陃 => 隣
"\u9643" => "\u96A3"

# 陋 => 陋
"\uF951" => "\u964B"

# 降 => 降
"\uFA09" => "\u964D"

# 陵 => 陵
"\uF959" => "\u9675"

# 陸 => 陸
"\uF9D3" => "\u9678"

# 隆 => 隆
"\uF9DC" => "\u9686"

# 鄰 => 隣
"\u9310" => "\u96A3"

# 隣 => 隣
"\uF9F1" => "\u96A3"

# 隸 => 隸
"\uF9B8" => "\u96B8"

# 離 => 離
"\uF9EA" => "\u9600"

# 零 => 零
"\uF9B2" => "\u96F6"

# 雷 => 雷
"\uF949" => "\u96F7"

# 露 => 露
"\uF938" => "\u9732"

# 靈 => 靈
"\uF9B3" => "\u9748"

# 靖 => 靖
"\uFA1C" => "\u9756"

# 領 => 領
"\uF9B4" => "\u9818"

# 頻 => 頻
"\uFA6A" => "\u983B"

# 類 => 類
"\uF9D0" => "\u985E"

# 飯 => 飯
"\uFA2A" => "\u98EF"

# 飲 => 飮
"\u98F2" => "\u98EE"

# 飼 => 飼
"\uFA2B" => "\u98FC"

# 館 => 館
"\uFA2C" => "\u9928"

# 駱 => 駱
"\uF91A" => "\u99F1"

# 騨 => 驒
"\u9A28" => "\u9A52"

# 驪 => 驪
"\uF987" => "\u9A6A"

# 髙 => 高
"\u9AD9" => "\u9AD8"

# 魯 => 魯
"\uF939" => "\u9B6F"

# 鱗 => 鱗
"\uF9F2" => "\u9C57"

# 鶴 => 鶴
"\uFA2D" => "\u9DB4"

# 鷺 => 鷺
"\uF93A" => "\u9DFA"

# 鸞 => 鸞
"\uF920" => "\u9E1E"

# 鹿 => 鹿
"\uF940" => "\u9E7F"

# 麗 => 麗
"\uF988" => "\u9E97"

# 麟 => 麟
"\uF9F3" => "\u9E9F"

# 麹 => 麴
"\u9EB9" => "\u9EB4"

# 黎 => 黎
"\uF989" => "\u9ECE"

# 龍 => 龍
"\uF9C4" => "\u9F8D"

# 龜 => 龜
"\uF907" => "\u9F9C"

# 龜 => 龜
"\uF908" => "\u9F9C"

And there's also this: https://issues.apache.org/jira/browse/LUCENE-8972

The issue description should clarify the distinction between charFilters and filters, and why it matters here. Note that the config Chris posted also acknowledged this distinction (and implemented a charFilter transliterator implementation that's custom to Stanford) and in the issue above I'm working toward implementing this functionality in a charFilter that can get merged into the upstream Lucene project.

He also mentions discussion here: https://github.com/apache/lucene/pull/15

ruthtillman commented 2 years ago

This is a 2013 analysis from Stanford which means that Solr has changed a good bit but might still have some relevant aspects http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html

ruthtillman commented 2 years ago

Sample data - searches for both titles should find the same record (among others, but these are titles so near the top) -- record url is in the example:

I selected 5 titles in simplified Chinese and 5 in traditional Chinese from our catalog. Then I matched the original titles with different Chinese writing versions. Please see the examples below:

Simplified 1 (RKT checked)

https://catalog.libraries.psu.edu/catalog/2358757

current title in simplified Chinese: 中国文学史
traditional Chinese: 中國文學史

Simplified 2 (RKT checked)

https://catalog.libraries.psu.edu/catalog/3145253

current title in simplified Chinese: 香港制造 : 香港电视剧黃金20年珍藏版 = Hongkongmake
traditional Chinese: 香港製造 : 香港電視劇黃金20年珍藏版 = Hongkongmake

Simplified 3 (2024-04 -- this one isn't working right but the others are)

https://catalog.libraries.psu.edu/catalog/12348543

current title in simplified Chinese: 广西壮族自治区地图
traditional Chinese: 廣西壯族自治區地圖

Simplified 4 (RKT checked)

https://catalog.libraries.psu.edu/catalog/25052079

current title in simplified Chinese: 清代广州十三行的兴衰 : 白银供应的角度 = Qingdai Guangzhou shisanhang de xingshuai : baiyin gongying de jiaodu
traditional Chinese: 清代廣州十三行的興衰 : 白銀供應的角度 = Qingdai Guangzhou shisanhang de xingshuai : baiyin gongying de jiaodu

Simplified 5 (RKT checked)

https://catalog.libraries.psu.edu/catalog/32864310

current title in simplified Chinese: 清代八旗驻防与东北社会变迁
traditional Chinese: 清代八旗駐防與東北社會變遷

Traditional 1 (this one failed for me - RKT)

https://catalog.libraries.psu.edu/catalog/8162802

current title in traditional Chinese: 憲政・中國 : 從現代化及文化轉變看中國憲政發展
simplified Chinese: 宪政・中国 : 从现代化及文化转变看中国宪政发展

Traditional 2 (RKT checked)

https://catalog.libraries.psu.edu/catalog/25057151

current title in traditional Chinese: 中國-台灣問題 : (涉台幹部讀本) 配套資料 : 從對岸看台灣, 國台辦權威文件
simplified Chinese: 中国-台湾问题 : (涉台干部读本) 配套资料 : 从对岸看台湾, 国台办权威文件

Traditional 3 (2024-04 -- this one isn't working right but the others are)

https://catalog.libraries.psu.edu/catalog/6193504

current title in traditional Chinese: 近代東亞鴻儒書法展: 中國, 日本,韓國
simplified Chinese: 近代东亚鸿儒书法展: 中国, 日本,韩国

Traditional 4 (RKT checked)

https://catalog.libraries.psu.edu/catalog/20040575

current title in traditional Chinese: 認識香港南亞少數族裔
simplified Chinese: 认识香港南亚少数族裔

Traditional 5 (RKT checked)

https://catalog.libraries.psu.edu/catalog/13801257

current title in traditional Chinese: 語言、社會與族群意識 : 台灣語言社會學的硏究
simplified Chinese: 语言、社会与族群意识 : 台湾语言社会学的硏究

ruthtillman commented 2 years ago

So there's still part of one field that wasn't working right for me but the basic title was working ok. I think maybe it's some other kind of error since the others are working ok? I think we deploy and maybe see if we get more feedback.

ajkiessl commented 2 years ago

If you strip the last character off the search that isn't working like this: 宪政・中国 : 从现代化及文化转变看中国宪政发, it works. Which is frustrating and confusing since that last character 展 is the same in traditional and simplified.

ruthtillman commented 2 years ago

Wow, that is indeed confusing.

psu-libraries / psulib_blacklight