psu-libraries / psulib_blacklight

Penn State University Libraries' Blacklight Catalog
Apache License 2.0
10 stars 3 forks source link

Traditional and simplified Chinese searching not behaving interchangeably #950

Closed ruthtillman closed 2 years ago

ruthtillman commented 2 years ago

Just received a report that our traditional and simplified Chinese character results are not behaving interchangeably. This means that if materials are described with simplified Chinese characters but the person searches using traditional Chinese, they will not find the materials. And vice-versa.. The CAT handled this successfully as does WorldCat.

Examples provided were:

梵蒂冈图书馆所藏汉籍目录  / 梵蒂岡圖書館所藏漢籍目錄

林则徐全集 / 林則徐全集

I can work with our Chinese-language catalogers to get some more examples.

It's a high priority because it affects access to all our Chinese materials. Unfortunately, this was only just now reported because the reporter had not been using the Catalog until the CAT went away.

CBeer shared in Blacklight Slack that they handle it this way:

https://github.com/sul-dlss/SearchWorks/blob/master/config/solr_configs/schema.xml#L481

and with https://github.com/sul-dlss/CJKFilterUtils

JRochkind suggested we look for a Solr analyzer. We might also be able to work with Radu on it?

ruthtillman commented 2 years ago

Michael Gibney at Penn adds:

You want solr.ICUTransformFilterFactory: https://solr.apache.org/guide/8_11/filter-descriptions.html#icu-transform-filter. The Stanford and Princeton examples are great; I'll add this (adapted/redacted) from Penn:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
        <charFilter class="solr.ICUTransformCharFilterFactory" id="Traditional-Simplified"/>
        <charFilter class="solr.ICUTransformCharFilterFactory" id="Katakana-Hiragana" />
        <tokenizer class="solr.ICUTokenizerFactory" />
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

But stock Solr does not have ICUTransform as a charFilter, only as a post-tokenization filter, so you'd need:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
        <tokenizer class="solr.ICUTokenizerFactory" />
        <charFilter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
        <charFilter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana" />
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>
ruthtillman commented 2 years ago

Gibney, continued (the above config refers to this file):

# the context of CJK vernacular scripts in bibliographic metadata.
# The list aims to unify searching for equivalent CJK characters that
# may have different Unicode encodings and/or presentation forms based
# on their input methods. It is derived from Princeton University
# Library's "IME variants not present in the EACC/MARC21 character
# sets" (https://library.princeton.edu/projects/eacc/oldindex.htm)
# and the Library of Congress Cataloging Policy and Support Office's
# CJK Compatibility Database (https://www.loc.gov/ils/cjk_search/cjk_cpso.html),
# both of which were created to support the work of catalogers and
# users of CJK scripts in MARC-21 records.

# ‐ => -
"\u2010" => "\u002D"

# 「 => 「
"\uFF62" => "\u300C"

# 」 => 」
"\uFF63" => "\u300D"

# 『 => 「
"\u300E" => "\u300C"

# 』 => 」
"\u300F" => "\u300D"

# < => <
"\uFF1C" => "\u003C"

# 0 => 〇
"\uFF10" => "\u3007"

# 不 => 不
"\uF967" => "\u4E0D"

# 串 => 串
"\uF905" => "\u4E32"

# 丹 => 丹
"\uF95E" => "\u4E39"

# 亂 => 亂
"\uF91B" => "\u4E82"

# 了 => 了
"\uF9BA" => "\u4E86"

# 亮 => 亮
"\uF977" => "\u4EAE"

# 亷 => 廉
"\u4EB7" => "\u5EC9"

# 什 => 什
"\uF9FD" => "\u4EC0"

# 令 => 令
"\uF9A8" => "\u4EE4"

# 來 => 來
"\uF92D" => "\u4F86"

# 例 => 例
"\uF9B5" => "\u4F8B"

# 便 => 便
"\uF965" => "\u4FBF"

# 倫 => 倫
"\uF9D4" => "\u502B"

# 倶 => 俱
"\u5036" => "\u4FF1"

# 僚 => 僚
"\uF9BB" => "\u50DA"

# 兀 => 兀
"\uFA0C" => "\u5140"

# 兩 => 兩
"\uF978" => "\u5169"

# 兪 => 俞
"\u516A" => "\u4FDE"

# 六 => 六
"\uF9D1" => "\u516D"

# 冷 => 冷
"\uF92E" => "\u51B7"

# 凉 => 凉
"\uF979" => "\u51C9"

# 凌 => 凌
"\uF955" => "\u51CC"

# 凜 => 凜
"\uF954" => "\u51DC"

# 切 => 切
"\uFA00" => "\u5207"

# 列 => 列
"\uF99C" => "\u5217"

# 利 => 利
"\uF9DD" => "\u5229"

# 刪 => 删
"\u522A" => "\u5220"

# 刺 => 刺
"\uF9FF" => "\u523A"

# 劉 => 劉
"\uF9C7" => "\u5289"

# 力 => 力
"\uF98A" => "\u529B"

# 劣 => 劣
"\uF99D" => "\u52A3"

# 勒 => 勒
"\uF952" => "\u52D2"

# 勞 => 勞
"\uF92F" => "\u52DE"

# 勵 => 勵
"\uF97F" => "\u52F5"

# 匀 => 勻
"\u5300" => "\u52FB"

# 北 => 北
"\uF963" => "\u5317"

# 匿 => 匿
"\uF9EB" => "\u533F"

# 卵 => 卵
"\uF91C" => "\u5375"

# 參 => 參
"\uF96B" => "\u53C3"

# 句 => 句
"\uF906" => "\u53E5"

# 吏 => 吏
"\uF9DE" => "\u540F"

# 吝 => 吝
"\uF9ED" => "\u541D"

# 呂 => 呂
"\uF980" => "\u5442"

# 咽 => 咽
"\uF99E" => "\u54BD"

# 喇 => 喇
"\uF90B" => "\u5587"

# 囹 => 囹
"\uF9A9" => "\u56F9"

# 圏 => 圈
"\u570F" => "\u5708"

# 塚 => 塚
"\uFA10" => "\u585A"

# 塞 => 塞
"\uF96C" => "\u585E"

# 壘 => 壘
"\uF94A" => "\u58D8"

# 壟 => 壟
"\uF942" => "\u58DF"

# 奈 => 奈
"\uF90C" => "\u5948"

# 契 => 契
"\uF909" => "\u5951"

# 女 => 女
"\uF981" => "\u5973"

# 姫 => 姬
"\u59EB" => "\u59EC"

# 宅 => 宅
"\uFA04" => "\u5B85"

# 宫 => 宮
"\u5BAB" => "\u5BAE"

# 寛 => 寬
"\u5BDB" => "\u5BEC"

# 寧 => 寧
"\uF95F" => "\u5BE7"

# 寧 => 寧
"\uF9AA" => "\u5BE7"

# 寮 => 寮
"\uF9BC" => "\u5BEE"

# 尙 => 尚
"\u5C19" => "\u5C1A"

# 尭 => 堯
"\u5C2D" => "\u582F"

# 尿 => 尿
"\uF9BD" => "\u5C3F"

# 屛 => 摒
"\u5C5B" => "\u6452"

# 屢 => 屢
"\uF94B" => "\u5C62"

# 履 => 履
"\uF9DF" => "\u5C65"

# 崙 => 崙
"\uF9D5" => "\u5D19"

# 嵐 => 嵐
"\uF921" => "\u5D50"

# 嶺 => 嶺
"\uF9AB" => "\u5DBA"

# 巗 => 巖
"\u5DD7" => "\u5DD6"

# 巻 => 卷
"\u5BFD" => "\u5377"

# 巻 => 卷
"\u5DFB" => "\u5377"

# 帲 => 帡
"\u5E32" => "\u5E21"

# 年 => 年
"\uF98E" => "\u5E74"

# 度 => 度
"\uFA01" => "\u5EA6"

# 廉 => 廉
"\uF9A2" => "\u5EC9"

# 廊 => 廊
"\uF928" => "\u5ECA"

# 廓 => 廓
"\uFA0B" => "\u5ED3"

# 廬 => 廬
"\uF982" => "\u5EEC"

# 弄 => 弄
"\uF943" => "\u5F04"

# 彚 => 彙
"\u5F5A" => "\u5F59"

# 律 => 律
"\uF9D8" => "\u5F8B"

# 復 => 復
"\uF966" => "\u5FA9"

# 念 => 念
"\uF9A3" => "\u5FF5"

# 怒 => 怒
"\uF960" => "\u6012"

# 怜 => 怜
"\uF9AC" => "\u601C"

# 惡 => 惡
"\uF9B9" => "\u60E1"

# 慄 => 慄
"\uF9D9" => "\u6144"

# 憐 => 憐
"\uF98F" => "\u6190"

# 懶 => 懶
"\uF90D" => "\u61F6"

# 戀 => 戀
"\uF990" => "\u6200"

# 戮 => 戮
"\uF9D2" => "\u622E"

# 戱 => 戯
"\u6231" => "\u622F"

# 户 => 戶
"\u6237" => "\u6236"

# 戸 => 戶
"\u6238" => "\u6236"

# 拉 => 拉
"\uF925" => "\u62C9"

# 拏 => 拏
"\uF95B" => "\u62CF"

# 拓 => 拓
"\uFA02" => "\u62D3"

# 拾 => 拾
"\uF973" => "\u62FE"

# 捻 => 捻
"\uF9A4" => "\u637B"

# 掠 => 掠
"\uF975" => "\u63A0"

# 揅 => 研
"\u63C5" => "\u7814"

# 揑 => 捏
"\u63D1" => "\u634F"

# 揷 => 挿
"\u63F7" => "\u633F"

# 揺 => 摇
"\u63FA" => "\u6447"

# 撚 => 撚
"\uF991" => "\u649A"

# 敍 => 敘
"\u654D" => "\u6558"

# 數 => 數
"\uF969" => "\u6578"

# 料 => 料
"\uF9BE" => "\u6599"

# 旅 => 旅
"\uF983" => "\u65C5"

# 易 => 易
"\uF9E0" => "\u6613"

# 昻 => 昂
"\u663B" => "\u6602"

# 晩 => 晚
"\u6669" => "\u665A"

# 晴 => 晴
"\uFA12" => "\u6674"

# 暈 => 暈
"\uF9C5" => "\u6688"

# 暴 => 暴
"\uFA06" => "\u66B4"

# 曆 => 曆
"\uF98B" => "\u66C6"

# 更 => 更
"\uF901" => "\u66F4"

# 朗 => 朗
"\uF929" => "\u6717"

# 李 => 李
"\uF9E1" => "\u674E"

# 杻 => 杻
"\uF9C8" => "\u677B"

# 林 => 林
"\uF9F4" => "\u6797"

# 柳 => 柳
"\uF9C9" => "\u67F3"

# 査 => 查
"\u67FB" => "\u67E5"

# 栗 => 栗
"\uF9DA" => "\u6817"

# 梁 => 梁
"\uF97A" => "\u6881"

# 梨 => 梨
"\uF9E2" => "\u68A8"

# 樂 => 樂
"\uF914" => "\u6A02"

# 樂 => 樂
"\uF95C" => "\u6A02"

# 樂 => 樂
"\uF9BF" => "\u6A02"

# 樓 => 樓
"\uF94C" => "\u6A13"

# 櫓 => 櫓
"\uF931" => "\u6AD3"

# 欄 => 欄
"\uF91D" => "\u6B04"

# 歩 => 步
"\u6B69" => "\u6B65"

# 歳 => 歲
"\u6B73" => "\u6B72"

# 歷 => 歷
"\uF98C" => "\u6B77"

# 殮 => 殮
"\uF9A5" => "\u6BAE"

# 殺 => 殺
"\uF970" => "\u6BBA"

# 毎 => 每
"\u6BCE" => "\u6BCF"

# 汚 => 污
"\u6C5A" => "\u6C61"

# 沈 => 沈
"\uF972" => "\u6C88"

# 沨 => 渢
"\u6CA8" => "\u6E22"

# 泌 => 泌
"\uF968" => "\u6CCC"

# 泥 => 泥
"\uF9E3" => "\u6CE5"

# 洛 => 洛
"\uF915" => "\u6D1B"

# 洞 => 洞
"\uFA05" => "\u6D1E"

# 流 => 流
"\uF9CA" => "\u6D41"

# 浪 => 浪
"\uF92A" => "\u6D6A"

# 淋 => 淋
"\uF9F5" => "\u6DCB"

# 淚 => 淚
"\uF94D" => "\u6DDA"

# 淪 => 淪
"\uF9D6" => "\u6DEA"

# 渉 => 涉
"\u6E09" => "\u6D89"

# 溜 => 溜
"\uF9CB" => "\u6E9C"

# 溺 => 溺
"\uF9EC" => "\u6EBA"

# 滑 => 滑
"\uF904" => "\u6ED1"

# 漏 => 漏
"\uF94E" => "\u6F0F"

# 漣 => 漣
"\uF992" => "\u6F23"

# 潊 => 潊
"\u6F4A" => "\u6F35"

# 濫 => 濫
"\uF922" => "\u6FEB"

# 濵 => 濱
"\u6FF5" => "\u6FF1"

# 濾 => 濾
"\uF984" => "\u6FFE"

# 炙 => 炙
"\uF9FB" => "\u7099"

# 烈 => 烈
"\uF99F" => "\u70C8"

# 烙 => 烙
"\uF916" => "\u70D9"

# 煉 => 煉
"\uF993" => "\u7149"

# 煕 => 熙
"\u7155" => "\u7199"

# 燎 => 燎
"\uF9C0" => "\u71CE"

# 燐 => 燐
"\uF9EE" => "\u71D0"

# 爐 => 爐
"\uF932" => "\u7210"

# 爛 => 爛
"\uF91E" => "\u721B"

# 爲 => 為
"\u7232" => "\u70BA"

# 牢 => 牢
"\uF946" => "\u7262"

# 狀 => 狀
"\uF9FA" => "\u72C0"

# 狼 => 狼
"\uF92B" => "\u72FC"

# 猪 => 猪
"\uFA16" => "\u732A"

# 獵 => 獵
"\uF9A7" => "\u7375"

# 率 => 率
"\uF961" => "\u7387"

# 率 => 率
"\uF9DB" => "\u7387"

# 玲 => 玲
"\uF9AD" => "\u73B2"

# 珞 => 珞
"\uF917" => "\u73DE"

# 理 => 理
"\uF9E4" => "\u7406"

# 琉 => 琉
"\uF9CC" => "\u7409"

# 瑩 => 瑩
"\uF9AE" => "\u7469"

# 瑶 => 瑤
"\u7476" => "\u7464"

# 璉 => 璉
"\uF994" => "\u7489"

# 璘 => 璘
"\uF9EF" => "\u7498"

# 甁 => 瓶
"\u7501" => "\u74F6"

# 留 => 留
"\uF9CD" => "\u7559"

# 略 => 略
"\uF976" => "\u7565"

# 異 => 異
"\uF962" => "\u7570"

# 痢 => 痢
"\uF9E5" => "\u7500"

# 療 => 療
"\uF9C1" => "\u7642"

# 癩 => 癩
"\uF90E" => "\u7669"

# 益 => 益
"\uFA17" => "\u76CA"

# 盧 => 盧
"\uF933" => "\u76E7"

# 省 => 省
"\uF96D" => "\u7701"

# 硫 => 硫
"\uF9CE" => "\u786B"

# 碌 => 碌
"\uF93B" => "\u788C"

# 磊 => 磊
"\uF947" => "\u78CA"

# 磻 => 磻
"\uF964" => "\u78FB"

# 礪 => 礪
"\uF985" => "\u792A"

# 礼 => 礼
"\uFA18" => "\u793C"

# 神 => 神
"\uFA19" => "\u795E"

# 祥 => 祥
"\uFA1A" => "\u7965"

# 祿 => 祿
"\uF93C" => "\u797F"

# 福 => 福
"\uFA1B" => "\u798F"

# 禮 => 禮
"\uF9B6" => "\u79AE"

# 秊 => 秊
"\uF995" => "\u79CA"

# 税 => 稅
"\u7A0E" => "\u7A05"

# 稜 => 稜
"\uF956" => "\u7A1C"

# 立 => 立
"\uF9F7" => "\u7ACB"

# 笠 => 笠
"\uF9F8" => "\u7B20"

# 簾 => 簾
"\uF9A6" => "\u7C3E"

# 籠 => 籠
"\uF944" => "\u7C60"

# 粒 => 粒
"\uF9F9" => "\u7C92"

# 粵 => 粤
"\u7CB5" => "\u7CA4"

# 精 => 精
"\uFA1D" => "\u7CBE"

# 糖 => 糖
"\uFA03" => "\u7CD6"

# 糧 => 糧
"\uF97B" => "\u7CE7"

# 紐 => 紐
"\uF9CF" => "\u7D10"

# 索 => 索
"\uF96A" => "\u7D22"

# 累 => 累
"\uF94F" => "\u7D2F"

# 綠 => 綠
"\uF93D" => "\u7DA0"

# 綾 => 綾
"\uF957" => "\u7DBE"

# 緃 => 縱
"\u7DC3" => "\u7E31"

# 緒 => 緖
"\u7DD2" => "\u7DD6"

# 練 => 練
"\uF996" => "\u7DF4"

# 縷 => 縷
"\uF950" => "\u7E37"

# 繋 => 繫
"\u7E4B" => "\u7E6B"

# 繍 => 繡
"\u7E4D" => "\u7E61"

# 罹 => 罹
"\uF9E6" => "\u7F79"

# 羅 => 羅
"\uF90F" => "\u7F85"

# 羚 => 羚
"\uF9AF" => "\u7F9A"

# 羽 => 羽
"\uFA1E" => "\u7FBD"

# 老 => 老
"\uF934" => "\u8001"

# 聆 => 聆
"\uF9B0" => "\u8046"

# 聯 => 聯
"\uF997" => "\u806F"

# 聾 => 聾
"\uF945" => "\u807E"

# 肋 => 肋
"\uF953" => "\u808B"

# 脱 => 脫
"\u8131" => "\u812B"

# 臘 => 臘
"\uF926" => "\u81D8"

# 臨 => 臨
"\uF9F6" => "\u81E8"

# 良 => 良
"\uF97C" => "\u826F"

# 若 => 若
"\uF974" => "\u82E5"

# 茶 => 茶
"\uF9FE" => "\u8336"

# 荆 => 荊
"\u8346" => "\u834A"

# 菉 => 菉
"\uF93E" => "\u83C9"

# 菱 => 菱
"\uF958" => "\u83F1"

# 落 => 落
"\uF918" => "\u843D"

# 葉 => 葉
"\uF96E" => "\u8449"

# 蓮 => 蓮
"\uF999" => "\u84EE"

# 蓼 => 蓼
"\uF9C2" => "\u84FC"

# 薫 => 薰
"\u85AB" => "\u85B0"

# 藍 => 藍
"\uF923" => "\u85CD"

# 藺 => 藺
"\uF9F0" => "\u85FA"

# 蘆 => 蘆
"\uF935" => "\u8606"

# 蘭 => 蘭
"\uF91F" => "\u862D"

# 蘿 => 蘿
"\uF910" => "\u863F"

# 虚 => 虛
"\u865A" => "\u865B"

# 虜 => 虜
"\uF936" => "\u865C"

# 螺 => 螺
"\uF911" => "\u87BA"

# 﨑 => 崎
"\uFA11" => "\u5D0E"

# 蠟 => 蠟
"\uF927" => "\u881F"

# 行 => 行
"\uFA08" => "\u884C"

# 裂 => 裂
"\uF9A0" => "\u88C2"

# 裏 => 裏
"\uF9E7" => "\u88CF"

# 裡 => 裡
"\uF9E8" => "\u88E1"

# 裵 => 裴
"\u88F5" => "\u88F4"

# 裸 => 裸
"\uF912" => "\u88F8"

# 襤 => 襤
"\uF924" => "\u8964"

# 見 => 見
"\uFA0A" => "\u898B"

# 說 => 說
"\uF96F" => "\u8AAA"

# 說 => 說
"\uF9A1" => "\u8AAA"

# 説 => 說
"\u8AAC" => "\u8AAA"

# 諒 => 諒
"\uF97D" => "\u8AD2"

# 論 => 論
"\uF941" => "\u8AD6"

# 諸 => 諸
"\uFA22" => "\u8AF8"

# 諾 => 諾
"\uF95D" => "\u8AFE"

# 識 => 識
"\uF9FC" => "\u8B58"

# 讀 => 讀
"\uF95A" => "\u8B80"

# 豈 => 豈
"\uF900" => "\u8C48"

# 賂 => 賂
"\uF948" => "\u8CC2"

# 賈 => 賈
"\uF903" => "\u8CC8"

# 路 => 路
"\uF937" => "\u8DEF"

# 車 => 車
"\uF902" => "\u8ECA"

# 輦 => 輦
"\uF998" => "\u8F26"

# 輪 => 輪
"\uF9D7" => "\u8F2A"

# 輻 => 輻
"\uFA07" => "\u8F3B"

# 轢 => 轢
"\uF98D" => "\u8F62"

# 辰 => 辰
"\uF971" => "\u8FB0"

# 連 => 連
"\uF99A" => "\u9023"

# 逸 => 逸
"\uFA25" => "\u9038"

# 遼 => 遼
"\uF9C3" => "\u907C"

# 邏 => 邏
"\uF913" => "\u908F"

# 郎 => 郎
"\uF92C" => "\u90CE"

# 郞 => 郎
"\u90DE" => "\u90CE"

# 郷 => 鄉
"\u90F7" => "\u9109"

# 都 => 都
"\uFA26" => "\u90FD"

# 鄕 => 鄉
"\u9115" => "\u9109"

# 酪 => 酪
"\uF919" => "\u916A"

# 醴 => 醴
"\uF9B7" => "\u91B4"

# 里 => 里
"\uF9E9" => "\u91CC"

# 量 => 量
"\uF97E" => "\u91CF"

# 金 => 金
"\uF90A" => "\u91D1"

# 鈴 => 鈴
"\uF9B1" => "\u9234"

# 鋭 => 銳
"\u92ED" => "\u92B3"

# 錄 => 錄
"\uF93F" => "\u9304"

# 録 => 錄
"\u9332" => "\u9304"

# 鍊 => 鍊
"\uF99B" => "\u934A"

# 鎸 => 鐫
"\u93B8" => "\u942B"

# 鐠 => 镨
"\u9420" => "\u9568"

# 閭 => 閭
"\uF986" => "\u95AD"

# 閲 => 閱
"\u95B2" => "\u95B1"

# 闗 => 關
"\u95D7" => "\u95DC"

# 阮 => 阮
"\uF9C6" => "\u962E"

# 陃 => 隣
"\u9643" => "\u96A3"

# 陋 => 陋
"\uF951" => "\u964B"

# 降 => 降
"\uFA09" => "\u964D"

# 陵 => 陵
"\uF959" => "\u9675"

# 陸 => 陸
"\uF9D3" => "\u9678"

# 隆 => 隆
"\uF9DC" => "\u9686"

# 鄰 => 隣
"\u9310" => "\u96A3"

# 隣 => 隣
"\uF9F1" => "\u96A3"

# 隸 => 隸
"\uF9B8" => "\u96B8"

# 離 => 離
"\uF9EA" => "\u9600"

# 零 => 零
"\uF9B2" => "\u96F6"

# 雷 => 雷
"\uF949" => "\u96F7"

# 露 => 露
"\uF938" => "\u9732"

# 靈 => 靈
"\uF9B3" => "\u9748"

# 靖 => 靖
"\uFA1C" => "\u9756"

# 領 => 領
"\uF9B4" => "\u9818"

# 頻 => 頻
"\uFA6A" => "\u983B"

# 類 => 類
"\uF9D0" => "\u985E"

# 飯 => 飯
"\uFA2A" => "\u98EF"

# 飲 => 飮
"\u98F2" => "\u98EE"

# 飼 => 飼
"\uFA2B" => "\u98FC"

# 館 => 館
"\uFA2C" => "\u9928"

# 駱 => 駱
"\uF91A" => "\u99F1"

# 騨 => 驒
"\u9A28" => "\u9A52"

# 驪 => 驪
"\uF987" => "\u9A6A"

# 髙 => 高
"\u9AD9" => "\u9AD8"

# 魯 => 魯
"\uF939" => "\u9B6F"

# 鱗 => 鱗
"\uF9F2" => "\u9C57"

# 鶴 => 鶴
"\uFA2D" => "\u9DB4"

# 鷺 => 鷺
"\uF93A" => "\u9DFA"

# 鸞 => 鸞
"\uF920" => "\u9E1E"

# 鹿 => 鹿
"\uF940" => "\u9E7F"

# 麗 => 麗
"\uF988" => "\u9E97"

# 麟 => 麟
"\uF9F3" => "\u9E9F"

# 麹 => 麴
"\u9EB9" => "\u9EB4"

# 黎 => 黎
"\uF989" => "\u9ECE"

# 龍 => 龍
"\uF9C4" => "\u9F8D"

# 龜 => 龜
"\uF907" => "\u9F9C"

# 龜 => 龜
"\uF908" => "\u9F9C"

And there's also this: https://issues.apache.org/jira/browse/LUCENE-8972

The issue description should clarify the distinction between charFilters and filters, and why it matters here. Note that the config Chris posted also acknowledged this distinction (and implemented a charFilter transliterator implementation that's custom to Stanford) and in the issue above I'm working toward implementing this functionality in a charFilter that can get merged into the upstream Lucene project.

He also mentions discussion here: https://github.com/apache/lucene/pull/15

ruthtillman commented 2 years ago

This is a 2013 analysis from Stanford which means that Solr has changed a good bit but might still have some relevant aspects http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html

ruthtillman commented 2 years ago

Sample data - searches for both titles should find the same record (among others, but these are titles so near the top) -- record url is in the example:

I selected 5 titles in simplified Chinese and 5 in traditional Chinese from our catalog. Then I matched the original titles with different Chinese writing versions. Please see the examples below:

Simplified 1 (RKT checked)

https://catalog.libraries.psu.edu/catalog/2358757

Simplified 2 (RKT checked)

https://catalog.libraries.psu.edu/catalog/3145253

Simplified 3 (2024-04 -- this one isn't working right but the others are)

https://catalog.libraries.psu.edu/catalog/12348543

Simplified 4 (RKT checked)

https://catalog.libraries.psu.edu/catalog/25052079

Simplified 5 (RKT checked)

https://catalog.libraries.psu.edu/catalog/32864310

Traditional 1 (this one failed for me - RKT)

https://catalog.libraries.psu.edu/catalog/8162802

Traditional 2 (RKT checked)

https://catalog.libraries.psu.edu/catalog/25057151

Traditional 3 (2024-04 -- this one isn't working right but the others are)

https://catalog.libraries.psu.edu/catalog/6193504

Traditional 4 (RKT checked)

https://catalog.libraries.psu.edu/catalog/20040575

Traditional 5 (RKT checked)

https://catalog.libraries.psu.edu/catalog/13801257

ruthtillman commented 2 years ago

So there's still part of one field that wasn't working right for me but the basic title was working ok. I think maybe it's some other kind of error since the others are working ok? I think we deploy and maybe see if we get more feedback.

ajkiessl commented 2 years ago

If you strip the last character off the search that isn't working like this: 宪政・中国 : 从现代化及文化转变看中国宪政发, it works. Which is frustrating and confusing since that last character is the same in traditional and simplified.

ruthtillman commented 2 years ago

Wow, that is indeed confusing.