retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.2k stars 285 forks source link

[Bug]: BBT does not recognize zh-CN #2391

Closed ZnqbuZ closed 1 year ago

ZnqbuZ commented 1 year ago

Debug log ID

SDZWJFW5-refs-apse

What happened?

This is a problem of Chinese word segmentation function in cite key generation. My formula is veryshorttitle(2,2). It seems that jieba won't be applied to items with language "zh-CN", but only those with language "zh".

For example, an item with title "法医学铁道损伤图谱", whose pinyin is "FaYiXueTieDaoSunShangTuPu", will be translated to "FayixueTiedao" when language is set to "zh", while "Fayixuetiedaosunshangtupu" when language is set to "zh-CN".

However, Zotero recommends storing language as two letter ISO language codes followed by two letter ISO country codes (e.g., en-US for American English, or de-DE for German), so "zh-CN" should be the "standard" language code, instead of just "zh".

Maybe BBT should regard all languages whose code contain "zh" as Chinese.

ZnqbuZ commented 1 year ago

By the way, it seems that the segmentation function cannot work properly on Traditional Chinese as well. For example, an item with title "改革歷程" (this is Traditional Chinese) will be translated to "GaigeLiCheng", whose capitalization is wrong; but if I change the title to "改革历程", which is the simplified version of "改革歷程", then it will be translated to "GaigeLicheng", which is expected.

I don't think this is a bug of jieba, since I have tested the string in the demo of jieba, and it gives correct segmentation.

The debug log ID related to this problem is 7XIF3YI8-refs-apse

ZnqbuZ commented 1 year ago

The function also fails when title contains English word. For example, "Windows 内核安全与驱动开发" is translated to "WindowS", which is obviously wrong, but if I change the title to "Windows内核安全与驱动开发" (deleting the space after "Windows"), then BBT works fine and gives 'WindowsNeihe", which is expected.

As another example, "Unreal Engine 4 蓝图完全学习教程" is translated to "UnreaEngin". while "UnrealEngine4蓝图完全学习教程" is translated to "UnrealEngine4Lantu"

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3694 ("fixes #2391, part 1")

Install in Zotero by downloading test build 6.7.53.3694, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

ZnqbuZ commented 1 year ago

The function also fails when title contains English word. For example, "Windows 内核安全与驱动开发" is translated to "WindowS", which is obviously wrong, but if I change the title to "Windows内核安全与驱动开发" (deleting the space after "Windows"), then BBT works fine and gives 'WindowsNeihe", which is expected.

As another example, "Unreal Engine 4 蓝图完全学习教程" is translated to "UnreaEngin". while "UnrealEngine4蓝图完全学习教程" is translated to "UnrealEngine4Lantu"

This is not a bug of BBT: the demo of jieba-js converts "Windows 内核安全与驱动开发" to "window- 内核 安全 驱动 开发" ("window- Neihe Anquan Qudong Kaifa" in pinyin), but BBT could circumvent this problem by replacing all non-CJK words by spaces before sending the string to jieba, and inserting them back after processed, so that jieba won't touch those English words.

Maybe a carefully designed formula could achieve this. I'm not sure.

retorquere commented 1 year ago

That seems like something better addressed in jieba-js?

ZnqbuZ commented 1 year ago

Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it.

In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"

ZnqbuZ commented 1 year ago

That seems like something better addressed in jieba-js?

Indeed. I don't think they should have overlooked this situation. I'll try to investigate their documentation to see if there's a solution.

ZnqbuZ commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3694 ("fixes #2391, part 1")

Install in Zotero by downloading test build 6.7.53.3694, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

Confirmed this works. Thanks a lot.

retorquere commented 1 year ago

I'm using this jieba lib btw: https://www.npmjs.com/package/ooooevan-jieba

There are javascript jieba libs with more recent updates, but I'm restricted to pure-javascript libs, and a number of the more recent ones use a C library under the hood, which I can't use.

retorquere commented 1 year ago

By the way, it seems that the segmentation function cannot work properly on Traditional Chinese as well. For example, an item with title "改革歷程" (this is Traditional Chinese) will be translated to "GaigeLiCheng", whose capitalization is wrong; but if I change the title to "改革历程", which is the simplified version of "改革歷程", then it will be translated to "GaigeLicheng", which is expected.

I'm still looking into this.

retorquere commented 1 year ago

Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it.

In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"

You can address this by adding either the Chinese character or its pinyin translation to the hidden pref skipWords.

retorquere commented 1 year ago

This library also works in Zotero and has seen more recent updates, but it's a fair bit slower, so it would have to have some quantifiable quality benefits: https://www.npmjs.com/package/js-jieba

edit: it cuts Unreal Engine 4 蓝图完全学习教程 to Unreal-Engine-4-蓝-图-完全-学-习-教程 (tw mode) or Unreal-Engine-4-蓝图-完全-学习-教程 (cn mode). But it is slow.

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3698 ("fixes #2391, part 2")

Install in Zotero by downloading test build 6.7.53.3698, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

retorquere commented 1 year ago

Interesting -- ooooevan-jieba cuts 改革歷程 to [ '改革', '歷', '程' ], where js-jieba cuts to [ '改革', '歷程' ]. I don't know which of these is right.

js-jieba is also just incredibly slow to start up -- once it's loaded it's actually significantly faster than ooooevan-jieba.

ZnqbuZ commented 1 year ago

Interesting -- ooooevan-jieba cuts 改革歷程 to [ '改革', '歷', '程' ], where js-jieba cuts to [ '改革', '歷程' ]. I don't know which of these is right.

js-jieba is also just incredibly slow to start up -- once it's loaded it's actually significantly faster than ooooevan-jieba.

[ '改革', '歷程' ] is correct. The segmentation of a Traditional word should be exactly the same with segmentation of its Simplified version.

Actually, you could convert all chinese strings to simplified ones before sending them to jieba, and then convert them back. This way is more accurate and won't be very time consuming.

Some reasons:

  1. jieba was developed for Simplified Chinese at first, so I think its dict only contains words in Simplified form...

  2. Simplified Chinese is just to convert traditional characters to simplified ones. For example, "歷" goes to "历"; both "繞" and "遶" go to "绕". Therefore, the simplification is basically a surjection from the set of Traditional characters to the set of Simplified ones. Indeed, some Traditional characters could be converted to multiple Simplified ones, such as "乾" goes to both "乾" and "干", but it's a very rare case.

ZnqbuZ commented 1 year ago

This library also works in Zotero and has seen more recent updates, but it's a fair bit slower, so it would have to have some quantifiable quality benefits: https://www.npmjs.com/package/js-jieba

edit: it cuts Unreal Engine 4 蓝图完全学习教程 to Unreal-Engine-4-蓝-图-完全-学-习-教程 (tw mode) or Unreal-Engine-4-蓝图-完全-学习-教程 (cn mode). But it is slow.

Well, in the demo of jieba-js, as long as I disable Porter Stemmer, it'll just ignore English words.

retorquere commented 1 year ago

you could convert all chinese strings to simplified ones before sending them to jieba

I wouldn't know how to.

but it's a very rare case

Which is why I want to leave this to libraries created by people who actually read and write Chinese (which is not me).

in the demo of jieba-js,

which I can't use, because it uses a C library under the hood

as long as I disable Porter Stemmer

I don't know what this means.

If jieba-js does the right thing, I'll use that. If another pure-javascript library does better, I'll test that.

ZnqbuZ commented 1 year ago

you could convert all chinese strings to simplified ones before sending them to jieba

I wouldn't know how to.

but it's a very rare case

Which is why I want to leave this to libraries created by people who actually read and write Chinese (which is not me).

in the demo of jieba-js,

which I can't use, because it uses a C library under the hood

as long as I disable Porter Stemmer

I don't know what this means.

If jieba-js does the right thing, I'll use that. If another pure-javascript library does better, I'll test that.

There is a trad->simp converter named opencc-js. You're right - I agree that this should be implemented in jieba instead of in BBT.

Porter Stemmer is a feature of the demo of jieba-js, which can be disabled in "Configuration".

Honestly, I think it's better to send only CJK characters to jieba, namely 4E00 ~ 9FFF, and it seems that ooooevan-jieba have implemented this at least in cutHMM, by using regex. Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba

ZnqbuZ commented 1 year ago

Another problem: still the example "Windows 内核安全与驱动开发". The character "与" means "and" in English, which is an empty word, and hence should not be in the cite key. Actually, jieba automatically delete "与" ("Yu" in pinyin) in its demo, but the citation key still contains it. In brief, jieba demo: "内核安全与驱动开发" -> "内核 安全 驱动 开发" -> "Neihe Anquan Qudong Kaifa" but BBT: "内核安全与驱动开发" -> "Neihe Anquan Yu Qudong Kaifa"

You can address this by adding either the Chinese character or its pinyin translation to the hidden pref skipWords.

Shouldn't skipWords be automatically applied to veryshorttitle (I'm using it)? I've tried test build 6.7.53.3698 ("fixes https://github.com/retorquere/zotero-better-bibtex/issues/2391, part 2"), but it doesn't seem to fix anything?

EroyalBoy commented 1 year ago

I also suffer this the same puzzle. All citation key have changed lowercase to Capitals in first such as xiaofeng2017 to XiaoFeng2017, which makes my all notes that should be in tune with this change.

ZnqbuZ commented 1 year ago

I also suffer this the same puzzle. All citation key have changed lowercase to Capitals in first such as xiaofeng2017 to XiaoFeng2017, which makes my all notes that should be in tune with this change.

Do you mean that your author names are mistakenly capitalized? Currently, just changing language to "zh" will solve this, and plugin Jasminum provides a batch function for this. If you would prefer not to change language to "zh", you could use author.transliterate.clean.lower.

retorquere commented 1 year ago

Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba

I'll try that

EroyalBoy commented 1 year ago

ok ,i’ll try

retorquere commented 1 year ago

Porter Stemmer is a feature of the demo of jieba-js, which can be disabled in "Configuration".

Which, again, I cannot use, because it uses a C library under the hood. We need to stop discussing the issue in terms of what this specific library can or cannot do. There are other libraries I can actually use, which we're also looking at.

Shouldn't skipWords be automatically applied to veryshorttitle (I'm using it)? I've tried test build 6.7.53.3698 ("fixes #2391, part 2"), but it doesn't seem to fix anything?

It is applied, but you must still add yu or to the skipWords.

Have you tried cut(str, true)? It gives correct segmentation in the readme of ooooevan-jieba

I have tried it, and it cuts to [ '改革', '歷', '程' ]

ZnqbuZ commented 1 year ago

It is applied, but you must still add yu or to the skipWords.

I see. Adding to the json file works. Thanks.

I have tried it, and it cuts to [ '改革', '歷', '程' ]

I meant that cut(str, true) should cuts English word correctly, as the doc shows. At least cutHMM should cuts English words correctly, since ooooevan-jieba says it uses regex to ignore English.

retorquere commented 1 year ago

I see. Adding to the json file works. Thanks.

To what json file?

I meant that cut(str, true) should cuts English word correctly, as the doc shows.

But that wouldn't really be a solution if it then still cuts Chinese wrongly

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3703 ("js-jieba cuts differently")

Install in Zotero by downloading test build 6.7.53.3703, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

ZnqbuZ commented 1 year ago

To what json file?

I exported preferences to a json file, edited skipWords there, and imported the file back, since I did not find the right place for editing it in GUI...

But that wouldn't really be a solution if it then still cuts Chinese wrongly

Did it? I mean the result it gives does not differ from the one the original method gives, so at least it did not become worse.

For the problem that jieba cuts Traditional Chinese wrongly, it's mainly a problem of jieba rather than BBT, but still BBT can do something to avoid this - it's up to you:

I think one of these should be enough:

  1. pre-conversion before jieba, by using maybe opencc-js as I said;
  2. add a Traditional Chinese user dict to jieba - you could let users do this themselves.
ZnqbuZ commented 1 year ago

🤖 this is your friendly neighborhood build bot announcing test build 6.7.53.3703 ("js-jieba cuts differently")

Install in Zotero by downloading test build 6.7.53.3703, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

Confirmed this works, for those Chinese-English mixed titles. Fixes the "Unreal Engine 4 蓝图完全学习教程" problem.

retorquere commented 1 year ago

I exported preferences to a json file, edited skipWords there, and imported the file back, since I did not find the right place for editing it in GUI...

Ah I see, you can edit them in the hidden preferences.

Did it? I mean the result it gives does not differ from the one the original method gives, so at least it did not become worse.

It was mentioned earlier that 改革歷程 should be cut to [ '改革', '歷程' ] (which js-jieba does), not [ '改革', '歷', '程' ] (which is what ooooevan-jieba does, regardless of the 2nd parameter to cut).

For the problem that jieba cuts Traditional Chinese wrongly, it's mainly a problem of jieba rather than BBT, but still BBT can do something to avoid this - it's up to you:

I think one of these should be enough:

1. pre-conversion before jieba, by using maybe opencc-js as I said;

I don't see why I should prefer this over using js-jieba, which apparently does the right thing by default.

2. add a Traditional Chinese user dict to jieba - you could let users do this themselves.

I prefer solutions that do not add new preferences to the already overwhelming set that BBT has.

ZnqbuZ commented 1 year ago

It was mentioned earlier that 改革歷程 should be cut to [ '改革', '歷程' ] (which js-jieba does), not [ '改革', '歷', '程' ] (which is what ooooevan-jieba does, regardless of the 2nd parameter to cut).

Well, certainly you could use js-jieba. It's better. I meant just in case you would like to stick to ooooevan-jieba, then the problem can be solved too.

Just note that js-jieba cannot cut Simplified Chinese strings correctly in tw mode, as this example shows:

edit: it cuts Unreal Engine 4 蓝图完全学习教程 to Unreal-Engine-4-蓝-图-完全-学-习-教程 (tw mode) or Unreal-Engine-4-蓝图-完全-学习-教程 (cn mode). But it is slow.

Maybe BBT should use tw mode only when language set to zh-TW.

retorquere commented 1 year ago

I'm currently just not using tw mode.

ZnqbuZ commented 1 year ago

I'm currently just not using tw mode.

I think tw mode may be better processing zh-TW, for after all it provides this mode independently... but as a user I'm OK with both as long as it works well.

retorquere commented 1 year ago

Language handling in BBT revolves around the languages supported by babel, I don't see anything about tw there. Changing language handling would be too big a change to take on right now.

ZnqbuZ commented 1 year ago

Language handling in BBT revolves around the languages supported by babel, I don't see anything about tw there. Changing language handling would be too big a change to take on right now.

Actually, what I meant was BBT could refer to the language field in Zotero's items.

You could use tw for all Traditional Chinese locales, namely zh-Hant*, chinese-hant* and Chinese-traditional*. There are 9 such files in the zh folder.

retorquere commented 1 year ago

Actually, what I meant was BBT could refer to the language field in Zotero's items.

I understand that, but from the myriads of values that could be in the language field, I need to bring it back to a domain of languages that I know how to deal with. I take info from the ini files in the babel repo. Adding zh-TW would be a special case. I might consider it, but I don't like special-case exceptions, generally.

You could use tw for all Traditional Chinese locales, namely zh-Hant, chinese-hant and Chinese-traditional*. There are 9 such files in the zh folder.

I'm not even sure what tw means in the context of the current discussion. I'd strongly prefer it if such information could be added to the babel administration.

ZnqbuZ commented 1 year ago

tw means Taiwan. Therefore, zh-TW is exactly zh-Hant or chinese-traditional - just like zh-CN is zh-Hans or chinese-simplified, and zh is just abbr of zh-CN, since China mainland uses Simplified Chinese.

retorquere commented 1 year ago

Babel in turn derives from https://cldr.unicode.org/, and there's nothing about tw there either. Locale-handling is surprisingly complicated, and if CLDR hasn't gotten to it, I'll wait it out too rather than trying my own. If babel updates, BBT automatically follows suit, as I import their config during the build.

ZnqbuZ commented 1 year ago

Babel in turn derives from https://cldr.unicode.org/, and there's nothing about tw there either. Locale-handling is surprisingly complicated, and if CLDR hasn't gotten to it, I'll wait it out too rather than trying my own. If babel updates, BBT automatically follows suit, as I import their config during the build.

If you do a search in the cldr repo, you'll find that zh-TW is implicitly zh-Hant, such as in this file.

I notice that BBT convert en-US to american and en-CA to candian, so why not do the same to zh-*? Actually zh-CN should be converted to chinese, zh-TW should be converted to chinese-hant, zh-HK should be converted to chinese-hant-hk, etc. It is not precise to convert all zh-* to just chinese.

ZnqbuZ commented 1 year ago
Here is a table of the correspondence between codes of all Chinese variants. language - country/region (Zotero) language - script (Babel)
zh, zh-CN, zh-Hans-CN chinese
zh-HK, zh-Hant-HK chinese-hant-hk
zh-Hans-HK chinese-hans-hk
zh-MO, zh-Hant-MO chinese-hant-mo
zh-Hans-MO chinese-hans-mo
zh-SG, zh-Hans-SG chinese-hans-sg
zh-Hant-SG chinese-hant-sg
zh-TW, zh-hant-TW chinese-hant
retorquere commented 1 year ago

If you do a search in the cldr repo, you'll find that zh-TW is implicitly zh-Hant, such as in this file.

Babel doesn't seem to follow suit though.

I notice that BBT convert en-US to american and en-CA to candian, so why not do the same to zh-*?

Because those are the primary names babel uses for en-CA and en-US, respectively. If Babel adds an ini-file for zh-TW, I will absolutely do the same for zh-* -- it is automated in fact.

ZnqbuZ commented 1 year ago

I see. Thanks for your patience. Confirmed that test build 6.7.53.3703 works well with "改革歷程", so all problems are fixed, and thus I will close this issue.

github-actions[bot] commented 1 year ago

Thanks for the feedback; there's no way you could have known, but @retorquere prefers to keep bugreports/enhancements open as a reminder to merge the changes into a new release.

retorquere commented 1 year ago

A new build is incoming that adds two things:

ZnqbuZ commented 1 year ago
  • if the language on the item is zh-hant-TW or zh-TW, it will apply tw mode when cut is called implicitly. I hate that this is a special case though, so I'd love it if you could submit a request to babel.

Sorry, I did not realize that babel does not support zh-TW, so now I think it's better to apply tw to and only to zh-Hant which means exactly zh-Hant-TW, and do nothing to zh-TW (just like zh-HK, zh-MO, etc). In this way, it won't be a special case.

There is some intricate political stuff behind the region TW, so I'm afraid babel would not change it for at least several years...

retorquere commented 1 year ago

New build incoming.

retorquere commented 1 year ago

There is some intricate political stuff behind the region TW, so I'm afraid babel would not change it for at least several years...

If babel does not want to touch it, I probably do not either.

ZnqbuZ commented 1 year ago

If babel does not want to touch it, I probably do not either.

I agree, so I think to at most apply tw to zh-Hant is enough

retorquere commented 1 year ago

I agree, so I think to at most apply tw to zh-Hant is enough

That sounds like something else than cutting though. Can you elaborate?

ZnqbuZ commented 1 year ago

By the way, isn't zh-CN also a special case? It seems that babel does not use this code as well.