Closed ruthtillman closed 2 years ago
Michael Gibney at Penn adds:
You want solr.ICUTransformFilterFactory: https://solr.apache.org/guide/8_11/filter-descriptions.html#icu-transform-filter. The Stanford and Princeton examples are great; I'll add this (adapted/redacted) from Penn:
<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
<charFilter class="solr.ICUTransformCharFilterFactory" id="Traditional-Simplified"/>
<charFilter class="solr.ICUTransformCharFilterFactory" id="Katakana-Hiragana" />
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
But stock Solr does not have ICUTransform as a charFilter, only as a post-tokenization filter, so you'd need:
<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/> <!-- no case folding initially -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-cjkMarcCompatibility.txt"/>
<tokenizer class="solr.ICUTokenizerFactory" />
<charFilter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
<charFilter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana" />
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
Gibney, continued (the above config refers to this file):
# the context of CJK vernacular scripts in bibliographic metadata.
# The list aims to unify searching for equivalent CJK characters that
# may have different Unicode encodings and/or presentation forms based
# on their input methods. It is derived from Princeton University
# Library's "IME variants not present in the EACC/MARC21 character
# sets" (https://library.princeton.edu/projects/eacc/oldindex.htm)
# and the Library of Congress Cataloging Policy and Support Office's
# CJK Compatibility Database (https://www.loc.gov/ils/cjk_search/cjk_cpso.html),
# both of which were created to support the work of catalogers and
# users of CJK scripts in MARC-21 records.
# ‐ => -
"\u2010" => "\u002D"
# 「 => 「
"\uFF62" => "\u300C"
# 」 => 」
"\uFF63" => "\u300D"
# 『 => 「
"\u300E" => "\u300C"
# 』 => 」
"\u300F" => "\u300D"
# < => <
"\uFF1C" => "\u003C"
# 0 => 〇
"\uFF10" => "\u3007"
# 不 => 不
"\uF967" => "\u4E0D"
# 串 => 串
"\uF905" => "\u4E32"
# 丹 => 丹
"\uF95E" => "\u4E39"
# 亂 => 亂
"\uF91B" => "\u4E82"
# 了 => 了
"\uF9BA" => "\u4E86"
# 亮 => 亮
"\uF977" => "\u4EAE"
# 亷 => 廉
"\u4EB7" => "\u5EC9"
# 什 => 什
"\uF9FD" => "\u4EC0"
# 令 => 令
"\uF9A8" => "\u4EE4"
# 來 => 來
"\uF92D" => "\u4F86"
# 例 => 例
"\uF9B5" => "\u4F8B"
# 便 => 便
"\uF965" => "\u4FBF"
# 倫 => 倫
"\uF9D4" => "\u502B"
# 倶 => 俱
"\u5036" => "\u4FF1"
# 僚 => 僚
"\uF9BB" => "\u50DA"
# 兀 => 兀
"\uFA0C" => "\u5140"
# 兩 => 兩
"\uF978" => "\u5169"
# 兪 => 俞
"\u516A" => "\u4FDE"
# 六 => 六
"\uF9D1" => "\u516D"
# 冷 => 冷
"\uF92E" => "\u51B7"
# 凉 => 凉
"\uF979" => "\u51C9"
# 凌 => 凌
"\uF955" => "\u51CC"
# 凜 => 凜
"\uF954" => "\u51DC"
# 切 => 切
"\uFA00" => "\u5207"
# 列 => 列
"\uF99C" => "\u5217"
# 利 => 利
"\uF9DD" => "\u5229"
# 刪 => 删
"\u522A" => "\u5220"
# 刺 => 刺
"\uF9FF" => "\u523A"
# 劉 => 劉
"\uF9C7" => "\u5289"
# 力 => 力
"\uF98A" => "\u529B"
# 劣 => 劣
"\uF99D" => "\u52A3"
# 勒 => 勒
"\uF952" => "\u52D2"
# 勞 => 勞
"\uF92F" => "\u52DE"
# 勵 => 勵
"\uF97F" => "\u52F5"
# 匀 => 勻
"\u5300" => "\u52FB"
# 北 => 北
"\uF963" => "\u5317"
# 匿 => 匿
"\uF9EB" => "\u533F"
# 卵 => 卵
"\uF91C" => "\u5375"
# 參 => 參
"\uF96B" => "\u53C3"
# 句 => 句
"\uF906" => "\u53E5"
# 吏 => 吏
"\uF9DE" => "\u540F"
# 吝 => 吝
"\uF9ED" => "\u541D"
# 呂 => 呂
"\uF980" => "\u5442"
# 咽 => 咽
"\uF99E" => "\u54BD"
# 喇 => 喇
"\uF90B" => "\u5587"
# 囹 => 囹
"\uF9A9" => "\u56F9"
# 圏 => 圈
"\u570F" => "\u5708"
# 塚 => 塚
"\uFA10" => "\u585A"
# 塞 => 塞
"\uF96C" => "\u585E"
# 壘 => 壘
"\uF94A" => "\u58D8"
# 壟 => 壟
"\uF942" => "\u58DF"
# 奈 => 奈
"\uF90C" => "\u5948"
# 契 => 契
"\uF909" => "\u5951"
# 女 => 女
"\uF981" => "\u5973"
# 姫 => 姬
"\u59EB" => "\u59EC"
# 宅 => 宅
"\uFA04" => "\u5B85"
# 宫 => 宮
"\u5BAB" => "\u5BAE"
# 寛 => 寬
"\u5BDB" => "\u5BEC"
# 寧 => 寧
"\uF95F" => "\u5BE7"
# 寧 => 寧
"\uF9AA" => "\u5BE7"
# 寮 => 寮
"\uF9BC" => "\u5BEE"
# 尙 => 尚
"\u5C19" => "\u5C1A"
# 尭 => 堯
"\u5C2D" => "\u582F"
# 尿 => 尿
"\uF9BD" => "\u5C3F"
# 屛 => 摒
"\u5C5B" => "\u6452"
# 屢 => 屢
"\uF94B" => "\u5C62"
# 履 => 履
"\uF9DF" => "\u5C65"
# 崙 => 崙
"\uF9D5" => "\u5D19"
# 嵐 => 嵐
"\uF921" => "\u5D50"
# 嶺 => 嶺
"\uF9AB" => "\u5DBA"
# 巗 => 巖
"\u5DD7" => "\u5DD6"
# 巻 => 卷
"\u5BFD" => "\u5377"
# 巻 => 卷
"\u5DFB" => "\u5377"
# 帲 => 帡
"\u5E32" => "\u5E21"
# 年 => 年
"\uF98E" => "\u5E74"
# 度 => 度
"\uFA01" => "\u5EA6"
# 廉 => 廉
"\uF9A2" => "\u5EC9"
# 廊 => 廊
"\uF928" => "\u5ECA"
# 廓 => 廓
"\uFA0B" => "\u5ED3"
# 廬 => 廬
"\uF982" => "\u5EEC"
# 弄 => 弄
"\uF943" => "\u5F04"
# 彚 => 彙
"\u5F5A" => "\u5F59"
# 律 => 律
"\uF9D8" => "\u5F8B"
# 復 => 復
"\uF966" => "\u5FA9"
# 念 => 念
"\uF9A3" => "\u5FF5"
# 怒 => 怒
"\uF960" => "\u6012"
# 怜 => 怜
"\uF9AC" => "\u601C"
# 惡 => 惡
"\uF9B9" => "\u60E1"
# 慄 => 慄
"\uF9D9" => "\u6144"
# 憐 => 憐
"\uF98F" => "\u6190"
# 懶 => 懶
"\uF90D" => "\u61F6"
# 戀 => 戀
"\uF990" => "\u6200"
# 戮 => 戮
"\uF9D2" => "\u622E"
# 戱 => 戯
"\u6231" => "\u622F"
# 户 => 戶
"\u6237" => "\u6236"
# 戸 => 戶
"\u6238" => "\u6236"
# 拉 => 拉
"\uF925" => "\u62C9"
# 拏 => 拏
"\uF95B" => "\u62CF"
# 拓 => 拓
"\uFA02" => "\u62D3"
# 拾 => 拾
"\uF973" => "\u62FE"
# 捻 => 捻
"\uF9A4" => "\u637B"
# 掠 => 掠
"\uF975" => "\u63A0"
# 揅 => 研
"\u63C5" => "\u7814"
# 揑 => 捏
"\u63D1" => "\u634F"
# 揷 => 挿
"\u63F7" => "\u633F"
# 揺 => 摇
"\u63FA" => "\u6447"
# 撚 => 撚
"\uF991" => "\u649A"
# 敍 => 敘
"\u654D" => "\u6558"
# 數 => 數
"\uF969" => "\u6578"
# 料 => 料
"\uF9BE" => "\u6599"
# 旅 => 旅
"\uF983" => "\u65C5"
# 易 => 易
"\uF9E0" => "\u6613"
# 昻 => 昂
"\u663B" => "\u6602"
# 晩 => 晚
"\u6669" => "\u665A"
# 晴 => 晴
"\uFA12" => "\u6674"
# 暈 => 暈
"\uF9C5" => "\u6688"
# 暴 => 暴
"\uFA06" => "\u66B4"
# 曆 => 曆
"\uF98B" => "\u66C6"
# 更 => 更
"\uF901" => "\u66F4"
# 朗 => 朗
"\uF929" => "\u6717"
# 李 => 李
"\uF9E1" => "\u674E"
# 杻 => 杻
"\uF9C8" => "\u677B"
# 林 => 林
"\uF9F4" => "\u6797"
# 柳 => 柳
"\uF9C9" => "\u67F3"
# 査 => 查
"\u67FB" => "\u67E5"
# 栗 => 栗
"\uF9DA" => "\u6817"
# 梁 => 梁
"\uF97A" => "\u6881"
# 梨 => 梨
"\uF9E2" => "\u68A8"
# 樂 => 樂
"\uF914" => "\u6A02"
# 樂 => 樂
"\uF95C" => "\u6A02"
# 樂 => 樂
"\uF9BF" => "\u6A02"
# 樓 => 樓
"\uF94C" => "\u6A13"
# 櫓 => 櫓
"\uF931" => "\u6AD3"
# 欄 => 欄
"\uF91D" => "\u6B04"
# 歩 => 步
"\u6B69" => "\u6B65"
# 歳 => 歲
"\u6B73" => "\u6B72"
# 歷 => 歷
"\uF98C" => "\u6B77"
# 殮 => 殮
"\uF9A5" => "\u6BAE"
# 殺 => 殺
"\uF970" => "\u6BBA"
# 毎 => 每
"\u6BCE" => "\u6BCF"
# 汚 => 污
"\u6C5A" => "\u6C61"
# 沈 => 沈
"\uF972" => "\u6C88"
# 沨 => 渢
"\u6CA8" => "\u6E22"
# 泌 => 泌
"\uF968" => "\u6CCC"
# 泥 => 泥
"\uF9E3" => "\u6CE5"
# 洛 => 洛
"\uF915" => "\u6D1B"
# 洞 => 洞
"\uFA05" => "\u6D1E"
# 流 => 流
"\uF9CA" => "\u6D41"
# 浪 => 浪
"\uF92A" => "\u6D6A"
# 淋 => 淋
"\uF9F5" => "\u6DCB"
# 淚 => 淚
"\uF94D" => "\u6DDA"
# 淪 => 淪
"\uF9D6" => "\u6DEA"
# 渉 => 涉
"\u6E09" => "\u6D89"
# 溜 => 溜
"\uF9CB" => "\u6E9C"
# 溺 => 溺
"\uF9EC" => "\u6EBA"
# 滑 => 滑
"\uF904" => "\u6ED1"
# 漏 => 漏
"\uF94E" => "\u6F0F"
# 漣 => 漣
"\uF992" => "\u6F23"
# 潊 => 潊
"\u6F4A" => "\u6F35"
# 濫 => 濫
"\uF922" => "\u6FEB"
# 濵 => 濱
"\u6FF5" => "\u6FF1"
# 濾 => 濾
"\uF984" => "\u6FFE"
# 炙 => 炙
"\uF9FB" => "\u7099"
# 烈 => 烈
"\uF99F" => "\u70C8"
# 烙 => 烙
"\uF916" => "\u70D9"
# 煉 => 煉
"\uF993" => "\u7149"
# 煕 => 熙
"\u7155" => "\u7199"
# 燎 => 燎
"\uF9C0" => "\u71CE"
# 燐 => 燐
"\uF9EE" => "\u71D0"
# 爐 => 爐
"\uF932" => "\u7210"
# 爛 => 爛
"\uF91E" => "\u721B"
# 爲 => 為
"\u7232" => "\u70BA"
# 牢 => 牢
"\uF946" => "\u7262"
# 狀 => 狀
"\uF9FA" => "\u72C0"
# 狼 => 狼
"\uF92B" => "\u72FC"
# 猪 => 猪
"\uFA16" => "\u732A"
# 獵 => 獵
"\uF9A7" => "\u7375"
# 率 => 率
"\uF961" => "\u7387"
# 率 => 率
"\uF9DB" => "\u7387"
# 玲 => 玲
"\uF9AD" => "\u73B2"
# 珞 => 珞
"\uF917" => "\u73DE"
# 理 => 理
"\uF9E4" => "\u7406"
# 琉 => 琉
"\uF9CC" => "\u7409"
# 瑩 => 瑩
"\uF9AE" => "\u7469"
# 瑶 => 瑤
"\u7476" => "\u7464"
# 璉 => 璉
"\uF994" => "\u7489"
# 璘 => 璘
"\uF9EF" => "\u7498"
# 甁 => 瓶
"\u7501" => "\u74F6"
# 留 => 留
"\uF9CD" => "\u7559"
# 略 => 略
"\uF976" => "\u7565"
# 異 => 異
"\uF962" => "\u7570"
# 痢 => 痢
"\uF9E5" => "\u7500"
# 療 => 療
"\uF9C1" => "\u7642"
# 癩 => 癩
"\uF90E" => "\u7669"
# 益 => 益
"\uFA17" => "\u76CA"
# 盧 => 盧
"\uF933" => "\u76E7"
# 省 => 省
"\uF96D" => "\u7701"
# 硫 => 硫
"\uF9CE" => "\u786B"
# 碌 => 碌
"\uF93B" => "\u788C"
# 磊 => 磊
"\uF947" => "\u78CA"
# 磻 => 磻
"\uF964" => "\u78FB"
# 礪 => 礪
"\uF985" => "\u792A"
# 礼 => 礼
"\uFA18" => "\u793C"
# 神 => 神
"\uFA19" => "\u795E"
# 祥 => 祥
"\uFA1A" => "\u7965"
# 祿 => 祿
"\uF93C" => "\u797F"
# 福 => 福
"\uFA1B" => "\u798F"
# 禮 => 禮
"\uF9B6" => "\u79AE"
# 秊 => 秊
"\uF995" => "\u79CA"
# 税 => 稅
"\u7A0E" => "\u7A05"
# 稜 => 稜
"\uF956" => "\u7A1C"
# 立 => 立
"\uF9F7" => "\u7ACB"
# 笠 => 笠
"\uF9F8" => "\u7B20"
# 簾 => 簾
"\uF9A6" => "\u7C3E"
# 籠 => 籠
"\uF944" => "\u7C60"
# 粒 => 粒
"\uF9F9" => "\u7C92"
# 粵 => 粤
"\u7CB5" => "\u7CA4"
# 精 => 精
"\uFA1D" => "\u7CBE"
# 糖 => 糖
"\uFA03" => "\u7CD6"
# 糧 => 糧
"\uF97B" => "\u7CE7"
# 紐 => 紐
"\uF9CF" => "\u7D10"
# 索 => 索
"\uF96A" => "\u7D22"
# 累 => 累
"\uF94F" => "\u7D2F"
# 綠 => 綠
"\uF93D" => "\u7DA0"
# 綾 => 綾
"\uF957" => "\u7DBE"
# 緃 => 縱
"\u7DC3" => "\u7E31"
# 緒 => 緖
"\u7DD2" => "\u7DD6"
# 練 => 練
"\uF996" => "\u7DF4"
# 縷 => 縷
"\uF950" => "\u7E37"
# 繋 => 繫
"\u7E4B" => "\u7E6B"
# 繍 => 繡
"\u7E4D" => "\u7E61"
# 罹 => 罹
"\uF9E6" => "\u7F79"
# 羅 => 羅
"\uF90F" => "\u7F85"
# 羚 => 羚
"\uF9AF" => "\u7F9A"
# 羽 => 羽
"\uFA1E" => "\u7FBD"
# 老 => 老
"\uF934" => "\u8001"
# 聆 => 聆
"\uF9B0" => "\u8046"
# 聯 => 聯
"\uF997" => "\u806F"
# 聾 => 聾
"\uF945" => "\u807E"
# 肋 => 肋
"\uF953" => "\u808B"
# 脱 => 脫
"\u8131" => "\u812B"
# 臘 => 臘
"\uF926" => "\u81D8"
# 臨 => 臨
"\uF9F6" => "\u81E8"
# 良 => 良
"\uF97C" => "\u826F"
# 若 => 若
"\uF974" => "\u82E5"
# 茶 => 茶
"\uF9FE" => "\u8336"
# 荆 => 荊
"\u8346" => "\u834A"
# 菉 => 菉
"\uF93E" => "\u83C9"
# 菱 => 菱
"\uF958" => "\u83F1"
# 落 => 落
"\uF918" => "\u843D"
# 葉 => 葉
"\uF96E" => "\u8449"
# 蓮 => 蓮
"\uF999" => "\u84EE"
# 蓼 => 蓼
"\uF9C2" => "\u84FC"
# 薫 => 薰
"\u85AB" => "\u85B0"
# 藍 => 藍
"\uF923" => "\u85CD"
# 藺 => 藺
"\uF9F0" => "\u85FA"
# 蘆 => 蘆
"\uF935" => "\u8606"
# 蘭 => 蘭
"\uF91F" => "\u862D"
# 蘿 => 蘿
"\uF910" => "\u863F"
# 虚 => 虛
"\u865A" => "\u865B"
# 虜 => 虜
"\uF936" => "\u865C"
# 螺 => 螺
"\uF911" => "\u87BA"
# 﨑 => 崎
"\uFA11" => "\u5D0E"
# 蠟 => 蠟
"\uF927" => "\u881F"
# 行 => 行
"\uFA08" => "\u884C"
# 裂 => 裂
"\uF9A0" => "\u88C2"
# 裏 => 裏
"\uF9E7" => "\u88CF"
# 裡 => 裡
"\uF9E8" => "\u88E1"
# 裵 => 裴
"\u88F5" => "\u88F4"
# 裸 => 裸
"\uF912" => "\u88F8"
# 襤 => 襤
"\uF924" => "\u8964"
# 見 => 見
"\uFA0A" => "\u898B"
# 說 => 說
"\uF96F" => "\u8AAA"
# 說 => 說
"\uF9A1" => "\u8AAA"
# 説 => 說
"\u8AAC" => "\u8AAA"
# 諒 => 諒
"\uF97D" => "\u8AD2"
# 論 => 論
"\uF941" => "\u8AD6"
# 諸 => 諸
"\uFA22" => "\u8AF8"
# 諾 => 諾
"\uF95D" => "\u8AFE"
# 識 => 識
"\uF9FC" => "\u8B58"
# 讀 => 讀
"\uF95A" => "\u8B80"
# 豈 => 豈
"\uF900" => "\u8C48"
# 賂 => 賂
"\uF948" => "\u8CC2"
# 賈 => 賈
"\uF903" => "\u8CC8"
# 路 => 路
"\uF937" => "\u8DEF"
# 車 => 車
"\uF902" => "\u8ECA"
# 輦 => 輦
"\uF998" => "\u8F26"
# 輪 => 輪
"\uF9D7" => "\u8F2A"
# 輻 => 輻
"\uFA07" => "\u8F3B"
# 轢 => 轢
"\uF98D" => "\u8F62"
# 辰 => 辰
"\uF971" => "\u8FB0"
# 連 => 連
"\uF99A" => "\u9023"
# 逸 => 逸
"\uFA25" => "\u9038"
# 遼 => 遼
"\uF9C3" => "\u907C"
# 邏 => 邏
"\uF913" => "\u908F"
# 郎 => 郎
"\uF92C" => "\u90CE"
# 郞 => 郎
"\u90DE" => "\u90CE"
# 郷 => 鄉
"\u90F7" => "\u9109"
# 都 => 都
"\uFA26" => "\u90FD"
# 鄕 => 鄉
"\u9115" => "\u9109"
# 酪 => 酪
"\uF919" => "\u916A"
# 醴 => 醴
"\uF9B7" => "\u91B4"
# 里 => 里
"\uF9E9" => "\u91CC"
# 量 => 量
"\uF97E" => "\u91CF"
# 金 => 金
"\uF90A" => "\u91D1"
# 鈴 => 鈴
"\uF9B1" => "\u9234"
# 鋭 => 銳
"\u92ED" => "\u92B3"
# 錄 => 錄
"\uF93F" => "\u9304"
# 録 => 錄
"\u9332" => "\u9304"
# 鍊 => 鍊
"\uF99B" => "\u934A"
# 鎸 => 鐫
"\u93B8" => "\u942B"
# 鐠 => 镨
"\u9420" => "\u9568"
# 閭 => 閭
"\uF986" => "\u95AD"
# 閲 => 閱
"\u95B2" => "\u95B1"
# 闗 => 關
"\u95D7" => "\u95DC"
# 阮 => 阮
"\uF9C6" => "\u962E"
# 陃 => 隣
"\u9643" => "\u96A3"
# 陋 => 陋
"\uF951" => "\u964B"
# 降 => 降
"\uFA09" => "\u964D"
# 陵 => 陵
"\uF959" => "\u9675"
# 陸 => 陸
"\uF9D3" => "\u9678"
# 隆 => 隆
"\uF9DC" => "\u9686"
# 鄰 => 隣
"\u9310" => "\u96A3"
# 隣 => 隣
"\uF9F1" => "\u96A3"
# 隸 => 隸
"\uF9B8" => "\u96B8"
# 離 => 離
"\uF9EA" => "\u9600"
# 零 => 零
"\uF9B2" => "\u96F6"
# 雷 => 雷
"\uF949" => "\u96F7"
# 露 => 露
"\uF938" => "\u9732"
# 靈 => 靈
"\uF9B3" => "\u9748"
# 靖 => 靖
"\uFA1C" => "\u9756"
# 領 => 領
"\uF9B4" => "\u9818"
# 頻 => 頻
"\uFA6A" => "\u983B"
# 類 => 類
"\uF9D0" => "\u985E"
# 飯 => 飯
"\uFA2A" => "\u98EF"
# 飲 => 飮
"\u98F2" => "\u98EE"
# 飼 => 飼
"\uFA2B" => "\u98FC"
# 館 => 館
"\uFA2C" => "\u9928"
# 駱 => 駱
"\uF91A" => "\u99F1"
# 騨 => 驒
"\u9A28" => "\u9A52"
# 驪 => 驪
"\uF987" => "\u9A6A"
# 髙 => 高
"\u9AD9" => "\u9AD8"
# 魯 => 魯
"\uF939" => "\u9B6F"
# 鱗 => 鱗
"\uF9F2" => "\u9C57"
# 鶴 => 鶴
"\uFA2D" => "\u9DB4"
# 鷺 => 鷺
"\uF93A" => "\u9DFA"
# 鸞 => 鸞
"\uF920" => "\u9E1E"
# 鹿 => 鹿
"\uF940" => "\u9E7F"
# 麗 => 麗
"\uF988" => "\u9E97"
# 麟 => 麟
"\uF9F3" => "\u9E9F"
# 麹 => 麴
"\u9EB9" => "\u9EB4"
# 黎 => 黎
"\uF989" => "\u9ECE"
# 龍 => 龍
"\uF9C4" => "\u9F8D"
# 龜 => 龜
"\uF907" => "\u9F9C"
# 龜 => 龜
"\uF908" => "\u9F9C"
And there's also this: https://issues.apache.org/jira/browse/LUCENE-8972
The issue description should clarify the distinction between charFilters and filters, and why it matters here. Note that the config Chris posted also acknowledged this distinction (and implemented a charFilter transliterator implementation that's custom to Stanford) and in the issue above I'm working toward implementing this functionality in a charFilter that can get merged into the upstream Lucene project.
He also mentions discussion here: https://github.com/apache/lucene/pull/15
This is a 2013 analysis from Stanford which means that Solr has changed a good bit but might still have some relevant aspects http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
Sample data - searches for both titles should find the same record (among others, but these are titles so near the top) -- record url is in the example:
I selected 5 titles in simplified Chinese and 5 in traditional Chinese from our catalog. Then I matched the original titles with different Chinese writing versions. Please see the examples below:
https://catalog.libraries.psu.edu/catalog/2358757
https://catalog.libraries.psu.edu/catalog/3145253
https://catalog.libraries.psu.edu/catalog/12348543
https://catalog.libraries.psu.edu/catalog/25052079
https://catalog.libraries.psu.edu/catalog/32864310
https://catalog.libraries.psu.edu/catalog/8162802
https://catalog.libraries.psu.edu/catalog/25057151
https://catalog.libraries.psu.edu/catalog/6193504
https://catalog.libraries.psu.edu/catalog/20040575
https://catalog.libraries.psu.edu/catalog/13801257
So there's still part of one field that wasn't working right for me but the basic title was working ok. I think maybe it's some other kind of error since the others are working ok? I think we deploy and maybe see if we get more feedback.
If you strip the last character off the search that isn't working like this: 宪政・中国 : 从现代化及文化转变看中国宪政发
, it works. Which is frustrating and confusing since that last character 展
is the same in traditional and simplified.
Wow, that is indeed confusing.
Just received a report that our traditional and simplified Chinese character results are not behaving interchangeably. This means that if materials are described with simplified Chinese characters but the person searches using traditional Chinese, they will not find the materials. And vice-versa.. The CAT handled this successfully as does WorldCat.
Examples provided were:
I can work with our Chinese-language catalogers to get some more examples.
It's a high priority because it affects access to all our Chinese materials. Unfortunately, this was only just now reported because the reporter had not been using the Catalog until the CAT went away.
CBeer shared in Blacklight Slack that they handle it this way:
https://github.com/sul-dlss/SearchWorks/blob/master/config/solr_configs/schema.xml#L481
and with https://github.com/sul-dlss/CJKFilterUtils
JRochkind suggested we look for a Solr analyzer. We might also be able to work with Radu on it?