Open YikSanChan opened 3 years ago
Hi @YikSanChan - thanks for writing in, and sorry you're not seeing success with Stork. This sounds like a bug to me, and I'll look into what's going on!
To set expectations: my personal life is oddly busy these days, so I might not be able to provide help immediately, but I'll get around to it once things start to settle.
James
Hi @jameslittle230, that's totally fine, thanks for the info, and take your time.
@YikSanChan - I'm truly sorry for the delay.
I've taken a look at the index file you provided (thank you so much for providing it!) and have determined that this is entirely Stork's fault.
The available search terms -- the contents you'd have to type in to get any results -- are longer than I expect.
[
"大都,元代以金的离宫今北海公园为中心重建",
"南京,辽太宗会同元年(938年),将原来的幽州升为幽都府,建号南京,又称燕京,作为辽的陪都。当时辽的首",
"京兆,民国废顺天府,置京兆地方,直隶中",
"北平,明代洪武元年(1368年),朱元璋灭掉元朝后,为了记载平定北方的功绩,将元大都改称",
"燕都,据史书记载,公元前1122年,周武王灭商以后,在燕封召",
"北京是一",
"京兆,民国废顺天府,置京兆地方,直隶中央,其范围包括北京大部分地区,民国十七年(1928年)废京兆地方,改北京为",
"大都,",
"北京是一座有着三千多年历史的古都,在不同的朝代有着不同的",
"北京,明永乐元年(1403年),朱棣取得皇位后,将他做燕王时的封地北平府改为顺天府,建北京城",
"北京,明永乐元年(1403年),朱棣取得皇位后,将",
"北京,明永乐元年(1403年),",
"幽州,远古时代的九州之一。幽州之名,最早见于《尚书·",
"京兆,民国废顺天府,置京兆地方,直隶中央,",
"北平,明代洪武元年(1368年)",
"幽州,远古时代的九州之一。幽州之名,最早见于《尚书·舜典》:“燕曰幽州。”两汉、魏、晋、唐代都曾设",
"燕都,据史书记载,公元前1122年,周武王灭商以后,在燕封召公。燕都因古时为燕国都城而得名。战",
"北京,明永乐元年",
"燕都,据史书记载,公元前1122年,周武王灭商以后,在燕封召公。燕都因古时为燕国都城而得名。战国七雄中有燕国,据说是因临近燕山而得",
"北",
"京师,永乐十八",
"北京,明永乐元年(1403年),朱棣取得皇位后,将他做燕王时的封地北平府改为顺天府,建北京城,并准备迁都城于此,这是正式命名为北京的开始,",
"幽州,远古时代的九州之一。幽州之名,最早见于",
... (there are 515 total entries in the list, all of which are rather long)
]
This would result in your not seeing any search results - when your index is being built, Stork doesn't know how to correctly parse the text.
I also apologize that I do not know Chinese, so I'm not entirely sure how Stork is failing here, but I am noticing a few things:
I'd like Stork to get better, so I might have to ask for your help. These are a few questions that I have -- if you could find time to provide answers, it would help me understand how Stork can better support Chinese.
京兆,民国废顺天府
, would you expect to get a result by searching for 兆民
? (I'm guessing no, but I just want to make extra sure that my assumptions are correct)@jameslittle230 Hi James, I am so glad you treat this seriously and I am more than happy to provide as much info as I can.
Is your expectation that you search for a single character and see search results with surrounding context?
This will be great. Why? Because sometimes a single Chinese character is a word. Probably that's why both Algoliasearch and Meilisearch provide such an experience. But there are much more words with > 1 character, compared to single-character words.
Would you then add another character and expect to see the search results filtered down to only those results that contain those two characters in sequence? Is that sequence RTL or LTR?
Yes, I will. It is left to right. To search Beijing (北京), I type 北 first, then append 京 to its right, now I have 北京
When do you use spaces in written text?
Almost never. Imagine I am saying "I come from Beijing". If I say that in Chinese, it's almost like "icomefrombeijing". Chinese speaker (and unfortunately, code that handles Chinese 😮💨 ) will need to parse the text into "I come from Beijing" and start from there.
Then how to parse Chinese programmatically?
To parse a Chinese sentence, the most popular library is called jieba (which means "stutter" in English). It has been ported to many programming languages, and here's a rust port. Try it out on this website https://app.gumble.pw/jiebademo/.
I paste this paragraph there, and here's the result:
北京 / 是 / 一座 / 有着 / 三千多年 / 历史 / 的 / 古都 / , / 在 / 不同 / 的 / 朝代 / 有着 / 不同 / 的 / 称谓 / , / 大致 / 算 / 起来 / 有 / 二十多个 / 别称 / 。 / / 燕都 / , / 据 / 史书 / 记载 / , / 公元前 / 1122 / 年 / , / 周武王 / 灭商 / 以后 / , / 在 / 燕 / 封召公 / 。 / 燕都 / 因 / 古时 / 为 / 燕国 / 都城 / 而 / 得名 / 。 / 战国七雄 / 中有 / 燕国 / , / 据说 / 是 / 因 / 临近 / 燕山 / 而 / 得 / 国名 / , / 其国 / 都 / 称为 / “ / 燕都 / ” / 。 / / 幽州 / , / 远古时代 / 的 / 九州 / 之一 / 。 / 幽州 / 之名 / , / 最早 / 见于 / 《 / 尚书 / · / 舜典 / 》 / : / “ / 燕 / 曰 / 幽州 / 。 / ” / 两汉 / 、 / 魏 / 、 / 晋 / 、 / 唐代 / 都 / 曾 / 设置 / 过 / 幽州 / , / 所治均 / 在 / 北京 / 一带 / 。 / / 京城 / , / 京城 / 泛指 / 国都 / , / 北京 / 成为 / 国都 / 后 / , / 也 / 多 / 将 / 其 / 称为 / 京城 / 。 / / 南京 / , / 辽 / 太宗 / 会同 / 元年 / ( / 938 / 年 / ) / , / 将 / 原来 / 的 / 幽州 / 升为 / 幽都府 / , / 建号 / 南京 / , / 又称 / 燕京 / , / 作为 / 辽 / 的 / 陪都 / 。 / 当时 / 辽 / 的 / 首都 / 在 / 上京 / 。 / / 大都 / , / 元代 / 以金 / 的 / 离宫 / 今 / 北海公园 / 为 / 中心 / 重建 / 新城 / , / 忽必烈 / 至元 / 九年 / ( / 1272 / 年 / ) / 改称 / 大都 / , / 俗称 / 元大都 / 。 / / 北平 / , / 明代 / 洪武 / 元年 / ( / 1368 / 年 / ) / , / 朱元璋 / 灭掉 / 元朝 / 后 / , / 为了 / 记载 / 平定 / 北方 / 的 / 功绩 / , / 将 / 元大都 / 改称 / 北平 / 。 / / 北京 / , / 明永乐 / 元年 / ( / 1403 / 年 / ) / , / 朱棣 / 取得 / 皇位 / 后 / , / 将 / 他 / 做燕 / 王时 / 的 / 封地 / 北平 / 府 / 改为 / 顺 / 天府 / , / 建 / 北京城 / , / 并 / 准备 / 迁都 / 城于 / 此 / , / 这是 / 正式 / 命名 / 为 / 北京 / 的 / 开始 / , / 今 / 已有 / 600 / 余年 / 的 / 历史 / 。 / / 京师 / , / 永乐 / 十八年 / ( / 1420 / 年 / ) / 迁都 / 北京 / , / 改称 / 京师 / , / 直至 / 清代 / 。 / / 京兆 / , / 民国 / 废顺 / 天府 / , / 置 / 京兆 / 地方 / , / 直隶 / 中央 / , / 其 / 范围 / 包括 / 北京 / 大部分 / 地区 / , / 民国 / 十七年 / ( / 1928 / 年 / ) / 废 / 京兆 / 地方 / , / 改 / 北京 / 为 / 北平 / 。
From what I can tell, the parsing is good enough. Some mistakes are probably because the paragraph itself is written in older school Chinese (imagine an undergrad studying Shakespeare and they write in that tone).
How much does punctuation matter? If given the text
京兆,民国废顺天府
, would you expect to get a result by searching for 兆民? (I'm guessing no, but I just want to make extra sure that my assumptions are correct)
兆民
should not get any search results, as this is not a word in Chinese.
I would suggest checking out Meilisearch docs (and even PRs) to tell how they implementation CJK tokenization.
Other than tokenization, Chinese also has its own set of stop words, as expected.
Let me know if you have any other questions! Highly appreciate it!
To reproduce:
~/softwares/stork-macos-10-15 build --input ch.toml --output ch.st
~/softwares/stork-macos-10-15 search --index ch.st --query "朱元璋"
, but there's no result. I try many other Chinese words that you can find in the zh-cn/beijing.md file, but I have no luck.I am not able to upload the index file on GitHub, so I send you an email, titled "Issue #191 index file" from evan.chanyiksan@gmail.com.
Thanks!