retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.2k stars 285 forks source link

[Bug]: BBT does not recognize zh-CN #2391

Closed ZnqbuZ closed 1 year ago

ZnqbuZ commented 1 year ago

Debug log ID

SDZWJFW5-refs-apse

What happened?

This is a problem of Chinese word segmentation function in cite key generation. My formula is veryshorttitle(2,2). It seems that jieba won't be applied to items with language "zh-CN", but only those with language "zh".

For example, an item with title "法医学铁道损伤图谱", whose pinyin is "FaYiXueTieDaoSunShangTuPu", will be translated to "FayixueTiedao" when language is set to "zh", while "Fayixuetiedaosunshangtupu" when language is set to "zh-CN".

However, Zotero recommends storing language as two letter ISO language codes followed by two letter ISO country codes (e.g., en-US for American English, or de-DE for German), so "zh-CN" should be the "standard" language code, instead of just "zh".

Maybe BBT should regard all languages whose code contain "zh" as Chinese.

ZnqbuZ commented 1 year ago

That sounds like something else than cutting though. Can you elaborate?

I noticed that babel has babel-zh-Hant.ini, which is in fact zh-TW, so maybe BBT can use tw when an item's lang is zh-Hant

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3709 ("cut tw")

Install in Zotero by downloading test build 6.7.53.3709, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

ZnqbuZ commented 1 year ago

🤖 this is your friendly neighborhood build bot announcing test build 6.7.53.3709 ("cut tw")

Install in Zotero by downloading test build 6.7.53.3709, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

Expected cn to be applied to all zh-* except for tw to be applied to zh-Hant. Hard to tell if it really works this way, but I'm satisfied with the result it gives.

retorquere commented 1 year ago

Oh wait you mean for the language field. I'll have to look into that.

ZnqbuZ commented 1 year ago

Let me summarize current behaviour of BBT about this language stuff, so that it may help others in future:

  1. when zh-Hans/zh-Hans-HK/zh-Hans-MO/zh-Hans-SG is filled, BBT regards them as chinese-simplified or chinese-simplified-%region%, where %region% is Hong Kong SAR/Macau SAR/Singapore respectively;
  2. when zh-Hant/zh-Hant-HK/zh-Hant-MO is filled, BBT regards them as chinese-traditional or chinese-traditional-%region%;
  3. when zh/zh-* is filled in language field, BBT regards them as chinese (Simplified Chinese);
  4. when zh-Hant is filled, BBT uses tw; otherwise BBT uses zh.

I'm satisfied with this behaviour.

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3712 ("add tw")

Install in Zotero by downloading test build 6.7.53.3712, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

ZnqbuZ commented 1 year ago

Great. Previously "新竹的交通大學要在2021年2月1日與台北的陽明大學合併" is cut to "新竹/的/交通/大學要/在/2/0/2/1/年/2/月/1/日/與/台北/的/陽明/大學合/併", which is wrong. (According to the demo of js-jieba it should be cut to "新竹/的/交通/大學/要/在/2/0/2/1/年/2/月/1/日/與/台北/的/陽明/大學/合併".)

Now changing the language to zh-Hant fixes this problem.

retorquere commented 1 year ago

Can you submit the sample where you set the language to zh-Hant (right-click and send a debug log)? I want to add that to my test suite to prevent regressions.

ZnqbuZ commented 1 year ago

I've uploaded logs of 3 items: IZSN8DRM-refs-apse, G84S5263-refs-apse, 8E4QSEHL-refs-apse The titles of these items are string picked from the demo of js-jieba. All of them has correct segmentation only when language is set to zh-Hant

ZnqbuZ commented 1 year ago

And I found that pinyin of "於" was wrong. Its pinyin is "Yu", but BBT gives "Wu" when language is set to "zh-*", while "Yu" otherwise. I've sent another log about this issue: RN8XNYVE-refs-apse. I think this is a bug of transliteration function since disabling jieba did not work, and older version (6.7.50) also have this bug.

ZnqbuZ commented 1 year ago

I cannot open BBT preference by Tools -> Better BibTeX after installing 3712 -- just nothing happened after I pressed the button. Could you reproduce this?

Edit: Even release 6.7.53 has this problem, but 6.7.50 seems good.

retorquere commented 1 year ago

given the citekey formula auth.fold.lower +"_"+ veryshorttitle(2,2) +"_"+ year, IZSN8DRM-refs-apse, G84S5263-refs-apse, 8E4QSEHL-refs-apse export to

@book{_MeizhuJinbiao_,
  title = {梅竹錦標對抗賽},
  langid = {chinese-traditional}
}

@book{_XiaomingBiye_,
  title = {小明畢業於國立交通大學資訊科學與工程研究所},
  langid = {chinese-traditional}
}

@book{_XinzhuJiaotong_,
  title = {新竹的交通大學要在2021年2月1日與台北的陽明大學合併},
  langid = {chinese-traditional}
}
ZnqbuZ commented 1 year ago

Wait, I thought my formula was title.transliterate.capitalize. If you change formula to this, you will see the difference of cite key between zh and zh-Hant.

ZnqbuZ commented 1 year ago

I've uploaded a new log FPPGY6P5-refs-apse containing these items. Their languages are set to zh-Hant so segmentations are correct. If change them to zh, then jieba will give different (hence wrong) segmentations.

retorquere commented 1 year ago

I cannot open BBT preference by Tools -> Better BibTeX after installing 3712 -- just nothing happened after I pressed the button. Could you reproduce this?

Edit: Even release 6.7.53 has this problem, but 6.7.50 seems good.

Please reproduce and send a debug log from the Help menu. I cannot replicate this.

retorquere commented 1 year ago

Wait, I thought my formula was title.transliterate.capitalize. If you change formula to this, you will see the difference of cite key between zh and zh-Hant.

Then I get

@book{FayixueTiedaoSunshangTupu,
  title = {法医学铁道损伤图谱},
  author = {肖, 发民},
  date = {2003},
  eprint = {gzA4AAAACAAJ},
  eprinttype = {googlebooks},
  publisher = {{郑州大学出版社}},
  abstract = {本书共收集图片400余幅并附以文字说明,以铁路上常见的各种伤亡为主,内容分为:辗轧伤、撞击、拖擦伤等共9章。},
  isbn = {978-7-81048-761-0},
  langid = {chinese},
  pagetotal = {153}
}

@book{FayixueTiedaoSunshangTupua,
  title = {法医学铁道损伤图谱},
  author = {肖, 发民},
  date = {2013},
  eprint = {gzA4AAAACAAJ},
  eprinttype = {googlebooks},
  publisher = {{郑州大学出版社}},
  abstract = {本书共收集图片400余幅并附以文字说明,以铁路上常见的各种伤亡为主,内容分为:辗轧伤、撞击、拖擦伤等共9章。},
  isbn = {978-7-81048-761-0},
  langid = {chinese},
  pagetotal = {153}
}

@book{GaigeLicheng,
  title = {改革歷程},
  author = {趙, 紫陽},
  date = {2009},
  eprint = {FVaOQQAACAAJ},
  eprinttype = {googlebooks},
  publisher = {{新世紀出版社}},
  isbn = {978-988-17202-7-6},
  langid = {chinese},
  pagetotal = {370}
}

@book{MeizhuJinbiaoDuikangsai,
  title = {梅竹錦標對抗賽},
  langid = {chinese-traditional}
}

@book{XiaomingBiyeWuGuoliJiaotongDaxueZixunKexueYuGongchengYanjiusuo,
  title = {小明畢業於國立交通大學資訊科學與工程研究所},
  langid = {chinese-traditional}
}

@book{XinzhuDeJiaotongDaxueYaoZai2021Nian2Yue1RiYuTaibeiDeYangmingDaxueHebing,
  title = {新竹的交通大學要在2021年2月1日與台北的陽明大學合併},
  langid = {chinese-traditional}
}
retorquere commented 1 year ago

We need to focus on one problem at a time. The conversation is getting fragmented.

ZnqbuZ commented 1 year ago

Then I get

These are correct segmentations, should be same with FPPGY6P5-refs-apse. The only difference is I set authors to "0" so that they appear at the top of my library.

retorquere commented 1 year ago

So the remaining issues are then:

correct?

ZnqbuZ commented 1 year ago

Yes

retorquere commented 1 year ago

Let's look at the prefs window first. Please enable debug logging in the Help menu, open the prefs to replicate the problem, and then send a BBT debug from the Help menu.

ZnqbuZ commented 1 year ago

I submitted 2 logs: 2DHAVQ4U-apse with version 6.7.50, where the window shows, and 2NFZWHBX-apse with version build 3712, where the problem occurs

ZnqbuZ commented 1 year ago

I installed Zotero and build 3712 on a fresh new virtual machine, and the problem still occurs. The debug log was submitted as 67ZW7L7V-apse

retorquere commented 1 year ago

I don't see any activity indicating the prefs are opened in 67ZW7L7V-apse and 2NFZWHBX-apse. Is that what you're seeing? The prefs window does not open at all?

retorquere commented 1 year ago

Oh wait, forget Tools->Better BibTeX, just open the Zotero prefs. I'll remove that item under the Tools menu, that's not supposed to be there yet.

ZnqbuZ commented 1 year ago

Oh wait, forget Tools->Better BibTeX, just open the Zotero prefs. I'll remove that item under the Tools menu, that's not supposed to be there yet.

Ah, I see. Then the only remaining is the problem of pinyin of , as you see BBT converts it to Wu when language is set to zh, which is wrong, but if I change language to anything else BBT will convert it correctly to Yu. Quite strange.

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.53.3717 ("upgrade pinyin lib")

Install in Zotero by downloading test build 6.7.53.3717, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

ZnqbuZ commented 1 year ago

Tested 3717 and the pinyin of "於" is correct now. Thank you.