Closed ZnqbuZ closed 1 year ago
By the way, it seems that the segmentation function cannot work properly on Traditional Chinese as well. For example, an item with title "改革歷程" (this is Traditional Chinese) will be translated to "GaigeLiCheng", whose capitalization is wrong; but if I change the title to "改革历程", which is the simplified version of "改革歷程", then it will be translated to "GaigeLicheng", which is expected.
I don't think this is a bug of jieba, since I have tested the string in the demo of jieba, and it gives correct segmentation.
The debug log ID related to this problem is 7XIF3YI8-refs-apse
The function also fails when title contains English word. For example, "Windows 内核安全与驱动开发" is translated to "WindowS", which is obviously wrong, but if I change the title to "Windows内核安全与驱动开发" (deleting the space after "Windows"), then BBT works fine and gives 'WindowsNeihe", which is expected.
As another example, "Unreal Engine 4 蓝图完全学习教程" is translated to "UnreaEngin". while "UnrealEngine4蓝图完全学习教程" is translated to "UnrealEngine4Lantu"
:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3694 ("fixes #2391, part 1")
Install in Zotero by downloading test build 6.7.53.3694, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
The function also fails when title contains English word. For example, "Windows 内核安全与驱动开发" is translated to "WindowS", which is obviously wrong, but if I change the title to "Windows内核安全与驱动开发" (deleting the space after "Windows"), then BBT works fine and gives 'WindowsNeihe", which is expected.
As another example, "Unreal Engine 4 蓝图完全学习教程" is translated to "UnreaEngin". while "UnrealEngine4蓝图完全学习教程" is translated to "UnrealEngine4Lantu"
This is not a bug of BBT: the demo of jieba-js converts "Windows 内核安全与驱动开发" to "window- 内核 安全 驱动 开发" ("window- Neihe Anquan Qudong Kaifa" in pinyin), but BBT could circumvent this problem by replacing all non-CJK words by spaces before sending the string to jieba, and inserting them back after processed, so that jieba won't touch those English words.
Maybe a carefully designed formula could achieve this. I'm not sure.
That seems like something better addressed in jieba-js
?
Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it.
In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"
That seems like something better addressed in
jieba-js
?
Indeed. I don't think they should have overlooked this situation. I'll try to investigate their documentation to see if there's a solution.
:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3694 ("fixes #2391, part 1")
Install in Zotero by downloading test build 6.7.53.3694, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
Confirmed this works. Thanks a lot.
I'm using this jieba lib btw: https://www.npmjs.com/package/ooooevan-jieba
There are javascript jieba libs with more recent updates, but I'm restricted to pure-javascript libs, and a number of the more recent ones use a C library under the hood, which I can't use.
By the way, it seems that the segmentation function cannot work properly on Traditional Chinese as well. For example, an item with title "改革歷程" (this is Traditional Chinese) will be translated to "GaigeLiCheng", whose capitalization is wrong; but if I change the title to "改革历程", which is the simplified version of "改革歷程", then it will be translated to "GaigeLicheng", which is expected.
I'm still looking into this.
Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it.
In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"
You can address this by adding either the Chinese character or its pinyin translation to the hidden pref skipWords
.
This library also works in Zotero and has seen more recent updates, but it's a fair bit slower, so it would have to have some quantifiable quality benefits: https://www.npmjs.com/package/js-jieba
edit: it cuts Unreal Engine 4 蓝图完全学习教程
to Unreal-Engine-4-蓝-图-完全-学-习-教程
(tw
mode) or Unreal-Engine-4-蓝图-完全-学习-教程
(cn
mode). But it is slow.
:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3698 ("fixes #2391, part 2")
Install in Zotero by downloading test build 6.7.53.3698, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
Interesting -- ooooevan-jieba cuts 改革歷程
to [ '改革', '歷', '程' ]
, where js-jieba cuts to [ '改革', '歷程' ]
. I don't know which of these is right.
js-jieba is also just incredibly slow to start up -- once it's loaded it's actually significantly faster than ooooevan-jieba.
Interesting -- ooooevan-jieba cuts
改革歷程
to[ '改革', '歷', '程' ]
, where js-jieba cuts to[ '改革', '歷程' ]
. I don't know which of these is right.js-jieba is also just incredibly slow to start up -- once it's loaded it's actually significantly faster than ooooevan-jieba.
[ '改革', '歷程' ]
is correct. The segmentation of a Traditional word should be exactly the same with segmentation of its Simplified version.
Actually, you could convert all chinese strings to simplified ones before sending them to jieba, and then convert them back. This way is more accurate and won't be very time consuming.
Some reasons:
jieba was developed for Simplified Chinese at first, so I think its dict only contains words in Simplified form...
Simplified Chinese is just to convert traditional characters to simplified ones. For example, "歷" goes to "历"; both "繞" and "遶" go to "绕". Therefore, the simplification is basically a surjection from the set of Traditional characters to the set of Simplified ones. Indeed, some Traditional characters could be converted to multiple Simplified ones, such as "乾" goes to both "乾" and "干", but it's a very rare case.
This library also works in Zotero and has seen more recent updates, but it's a fair bit slower, so it would have to have some quantifiable quality benefits: https://www.npmjs.com/package/js-jieba
edit: it cuts
Unreal Engine 4 蓝图完全学习教程
toUnreal-Engine-4-蓝-图-完全-学-习-教程
(tw
mode) orUnreal-Engine-4-蓝图-完全-学习-教程
(cn
mode). But it is slow.
Well, in the demo of jieba-js, as long as I disable Porter Stemmer, it'll just ignore English words.
you could convert all chinese strings to simplified ones before sending them to jieba
I wouldn't know how to.
but it's a very rare case
Which is why I want to leave this to libraries created by people who actually read and write Chinese (which is not me).
in the demo of jieba-js,
which I can't use, because it uses a C library under the hood
as long as I disable Porter Stemmer
I don't know what this means.
If jieba-js does the right thing, I'll use that. If another pure-javascript library does better, I'll test that.
you could convert all chinese strings to simplified ones before sending them to jieba
I wouldn't know how to.
but it's a very rare case
Which is why I want to leave this to libraries created by people who actually read and write Chinese (which is not me).
in the demo of jieba-js,
which I can't use, because it uses a C library under the hood
as long as I disable Porter Stemmer
I don't know what this means.
If jieba-js does the right thing, I'll use that. If another pure-javascript library does better, I'll test that.
There is a trad->simp converter named opencc-js. You're right - I agree that this should be implemented in jieba instead of in BBT.
Porter Stemmer is a feature of the demo of jieba-js, which can be disabled in "Configuration".
Honestly, I think it's better to send only CJK characters to jieba, namely 4E00 ~ 9FFF, and it seems that ooooevan-jieba have implemented this at least in cutHMM
, by using regex. Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba
Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it. In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"
You can address this by adding either the Chinese character or its pinyin translation to the hidden pref
skipWords
.
Shouldn't skipWords
be automatically applied to veryshorttitle
(I'm using it)? I've tried test build 6.7.53.3698 ("fixes https://github.com/retorquere/zotero-better-bibtex/issues/2391, part 2"), but it doesn't seem to fix anything?
I also suffer this the same puzzle. All citation key have changed lowercase to Capitals in first such as xiaofeng2017 to XiaoFeng2017, which makes my all notes that should be in tune with this change.
I also suffer this the same puzzle. All citation key have changed lowercase to Capitals in first such as xiaofeng2017 to XiaoFeng2017, which makes my all notes that should be in tune with this change.
Do you mean that your author names are mistakenly capitalized? Currently, just changing language to "zh" will solve this, and plugin Jasminum
provides a batch function for this. If you would prefer not to change language to "zh", you could use author.transliterate.clean.lower
.
Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba
I'll try that
ok ,i’ll try
Porter Stemmer is a feature of the demo of jieba-js, which can be disabled in "Configuration".
Which, again, I cannot use, because it uses a C library under the hood. We need to stop discussing the issue in terms of what this specific library can or cannot do. There are other libraries I can actually use, which we're also looking at.
Shouldn't
skipWords
be automatically applied toveryshorttitle
(I'm using it)? I've tried test build 6.7.53.3698 ("fixes #2391, part 2"), but it doesn't seem to fix anything?
It is applied, but you must still add yu
or 与
to the skipWords.
Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba
I have tried it, and it cuts to [ '改革', '歷', '程' ]
It is applied, but you must still add
yu
or与
to the skipWords.
I see. Adding 与
to the json
file works. Thanks.
I have tried it, and it cuts to
[ '改革', '歷', '程' ]
I meant that cut(str, true)
should cuts English word correctly, as the doc shows. At least cutHMM
should cuts English words correctly, since ooooevan-jieba says it uses regex to ignore English.
I see. Adding
与
to thejson
file works. Thanks.
To what json file?
I meant that
cut(str, true)
should cuts English word correctly, as the doc shows.
But that wouldn't really be a solution if it then still cuts Chinese wrongly
:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3703 ("js-jieba cuts differently")
Install in Zotero by downloading test build 6.7.53.3703, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
To what json file?
I exported preferences to a json file, edited skipWords
there, and imported the file back, since I did not find the right place for editing it in GUI...
But that wouldn't really be a solution if it then still cuts Chinese wrongly
Did it? I mean the result it gives does not differ from the one the original method gives, so at least it did not become worse.
For the problem that jieba cuts Traditional Chinese wrongly, it's mainly a problem of jieba rather than BBT, but still BBT can do something to avoid this - it's up to you:
I think one of these should be enough:
🤖 this is your friendly neighborhood build bot announcing test build 6.7.53.3703 ("js-jieba cuts differently")
Install in Zotero by downloading test build 6.7.53.3703, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
Confirmed this works, for those Chinese-English mixed titles. Fixes the "Unreal Engine 4 蓝图完全学习教程" problem.
I exported preferences to a json file, edited
skipWords
there, and imported the file back, since I did not find the right place for editing it in GUI...
Ah I see, you can edit them in the hidden preferences.
Did it? I mean the result it gives does not differ from the one the original method gives, so at least it did not become worse.
It was mentioned earlier that 改革歷程
should be cut to [ '改革', '歷程' ]
(which js-jieba
does), not [ '改革', '歷', '程' ]
(which is what ooooevan-jieba
does, regardless of the 2nd parameter to cut
).
For the problem that jieba cuts Traditional Chinese wrongly, it's mainly a problem of jieba rather than BBT, but still BBT can do something to avoid this - it's up to you:
I think one of these should be enough:
1. pre-conversion before jieba, by using maybe opencc-js as I said;
I don't see why I should prefer this over using js-jieba
, which apparently does the right thing by default.
2. add a Traditional Chinese user dict to jieba - you could let users do this themselves.
I prefer solutions that do not add new preferences to the already overwhelming set that BBT has.
It was mentioned earlier that
改革歷程
should be cut to[ '改革', '歷程' ]
(whichjs-jieba
does), not[ '改革', '歷', '程' ]
(which is whatooooevan-jieba
does, regardless of the 2nd parameter tocut
).
Well, certainly you could use js-jieba
. It's better. I meant just in case you would like to stick to ooooevan-jieba
, then the problem can be solved too.
Just note that js-jieba
cannot cut Simplified Chinese strings correctly in tw
mode, as this example shows:
edit: it cuts
Unreal Engine 4 蓝图完全学习教程
toUnreal-Engine-4-蓝-图-完全-学-习-教程
(tw
mode) orUnreal-Engine-4-蓝图-完全-学习-教程
(cn
mode). But it is slow.
Maybe BBT should use tw
mode only when language set to zh-TW
.
I'm currently just not using tw
mode.
I'm currently just not using
tw
mode.
I think tw
mode may be better processing zh-TW
, for after all it provides this mode independently... but as a user I'm OK with both as long as it works well.
Language handling in BBT revolves around the languages supported by babel, I don't see anything about tw
there. Changing language handling would be too big a change to take on right now.
Language handling in BBT revolves around the languages supported by babel, I don't see anything about
tw
there. Changing language handling would be too big a change to take on right now.
Actually, what I meant was BBT could refer to the language field in Zotero's items.
You could use tw
for all Traditional Chinese locales, namely zh-Hant*
, chinese-hant*
and Chinese-traditional*
. There are 9 such files in the zh
folder.
Actually, what I meant was BBT could refer to the language field in Zotero's items.
I understand that, but from the myriads of values that could be in the language field, I need to bring it back to a domain of languages that I know how to deal with. I take info from the ini files in the babel repo. Adding zh-TW
would be a special case. I might consider it, but I don't like special-case exceptions, generally.
You could use tw for all Traditional Chinese locales, namely zh-Hant, chinese-hant and Chinese-traditional*. There are 9 such files in the zh folder.
I'm not even sure what tw
means in the context of the current discussion. I'd strongly prefer it if such information could be added to the babel administration.
tw
means Taiwan. Therefore, zh-TW
is exactly zh-Hant
or chinese-traditional
- just like zh-CN
is zh-Hans
or chinese-simplified
, and zh
is just abbr of zh-CN
, since China mainland uses Simplified Chinese.
Babel in turn derives from https://cldr.unicode.org/, and there's nothing about tw
there either. Locale-handling is surprisingly complicated, and if CLDR hasn't gotten to it, I'll wait it out too rather than trying my own. If babel updates, BBT automatically follows suit, as I import their config during the build.
Babel in turn derives from https://cldr.unicode.org/, and there's nothing about
tw
there either. Locale-handling is surprisingly complicated, and if CLDR hasn't gotten to it, I'll wait it out too rather than trying my own. If babel updates, BBT automatically follows suit, as I import their config during the build.
If you do a search in the cldr repo, you'll find that zh-TW
is implicitly zh-Hant
, such as in this file.
I notice that BBT convert en-US
to american
and en-CA
to candian
, so why not do the same to zh-*
? Actually zh-CN
should be converted to chinese
, zh-TW
should be converted to chinese-hant
, zh-HK
should be converted to chinese-hant-hk
, etc. It is not precise to convert all zh-*
to just chinese
.
Here is a table of the correspondence between codes of all Chinese variants. | language - country/region (Zotero) | language - script (Babel) |
---|---|---|
zh, zh-CN, zh-Hans-CN | chinese | |
zh-HK, zh-Hant-HK | chinese-hant-hk | |
zh-Hans-HK | chinese-hans-hk | |
zh-MO, zh-Hant-MO | chinese-hant-mo | |
zh-Hans-MO | chinese-hans-mo | |
zh-SG, zh-Hans-SG | chinese-hans-sg | |
zh-Hant-SG | chinese-hant-sg | |
zh-TW, zh-hant-TW | chinese-hant |
If you do a search in the cldr repo, you'll find that zh-TW is implicitly zh-Hant, such as in this file.
Babel doesn't seem to follow suit though.
I notice that BBT convert en-US to american and en-CA to candian, so why not do the same to zh-*?
Because those are the primary names babel uses for en-CA and en-US, respectively. If Babel adds an ini-file for zh-TW
, I will absolutely do the same for zh-* -- it is automated in fact.
I see. Thanks for your patience. Confirmed that test build 6.7.53.3703 works well with "改革歷程", so all problems are fixed, and thus I will close this issue.
Thanks for the feedback; there's no way you could have known, but @retorquere prefers to keep bugreports/enhancements open as a reminder to merge the changes into a new release.
A new build is incoming that adds two things:
zh-hant-TW
or zh-TW
, it will apply tw
mode when cut is called implicitly. I hate that this is a special case though, so I'd love it if you could submit a request to babel
..jieba
filter can now be called as .jieba(cn)
(same as .jieba
) or as .jieba(tw)
.
- if the language on the item is
zh-hant-TW
orzh-TW
, it will applytw
mode when cut is called implicitly. I hate that this is a special case though, so I'd love it if you could submit a request tobabel
.
Sorry, I did not realize that babel does not support zh-TW
, so now I think it's better to apply tw
to and only to zh-Hant
which means exactly zh-Hant-TW
, and do nothing to zh-TW
(just like zh-HK
, zh-MO
, etc). In this way, it won't be a special case.
There is some intricate political stuff behind the region TW, so I'm afraid babel
would not change it for at least several years...
New build incoming.
There is some intricate political stuff behind the region TW, so I'm afraid babel would not change it for at least several years...
If babel
does not want to touch it, I probably do not either.
If
babel
does not want to touch it, I probably do not either.
I agree, so I think to at most apply tw
to zh-Hant
is enough
I agree, so I think to at most apply
tw
tozh-Hant
is enough
That sounds like something else than cutting though. Can you elaborate?
By the way, isn't zh-CN
also a special case? It seems that babel
does not use this code as well.
Debug log ID
SDZWJFW5-refs-apse
What happened?
This is a problem of Chinese word segmentation function in cite key generation. My formula is veryshorttitle(2,2). It seems that jieba won't be applied to items with language "zh-CN", but only those with language "zh".
For example, an item with title "法医学铁道损伤图谱", whose pinyin is "FaYiXueTieDaoSunShangTuPu", will be translated to "FayixueTiedao" when language is set to "zh", while "Fayixuetiedaosunshangtupu" when language is set to "zh-CN".
However, Zotero recommends storing language as two letter ISO language codes followed by two letter ISO country codes (e.g., en-US for American English, or de-DE for German), so "zh-CN" should be the "standard" language code, instead of just "zh".
Maybe BBT should regard all languages whose code contain "zh" as Chinese.