retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5k stars 277 forks source link

Treat ideographs as individual words for key generation #1353

Closed forallsunday closed 4 years ago

forallsunday commented 4 years ago

Hi, I love the better bibtex in zotero when citing the references, however, if the title is in Chinese, the citation key will be too long.

For example, if the format is set to ​[auth:lower]_[year]_[shorttitle3_3], then

@phdthesis{liwenhui_2014_HangPaiShiPinZhongYunDongMuBiaoDeJianCeYuGenZongSuanFaYanJiu,
  author = {{李文辉}},
  title = {航拍视频中运动目标的检测与跟踪算法研究},
  type = {{{PhD Thesis}}},
  school = {西安: 西安电子科技大学},
  year = {2014}
}

The spelling is correct, but the citation key is shown as the whole title. Is there anyway to custimize the length of citation key? Something like "liwenhui_2014_HangPaiShiPin", which is keeping the first 4 words having the captical charactors when exporting to bib?

retorquere commented 4 years ago

Is every character a word in Chinese?

In the mail I received when the issue was opened I also see something about a postscript, which I don't see here. If that's still relevant, open a separate issue for that please. I don't want to be handling multiple concerns in one issue.

forallsunday commented 4 years ago

Is every character a word in Chinese?

In the mail I received when the issue was opened I also see something about a postscript, which I don't see here. If that's still relevant, open a separate issue for that please. I don't want to be handling multiple concerns in one issue.

In Chinese, a sentence consists of some charactors without space. Zoreto regards a sentence as a single word, which is actually quite common in many softwares. Some softwares like MS word, google chrome, have some kind of function for seperating the charactors into words in a sentence. I think the easier way to keep citation key of Chinese title slim in the .bib file, is to keep the fisrt N words when exporting. Finding the first N capital charactors, and keep relevant words, maybe this will work by postscript. Sadly, I have no experience to use javascript.

And the postscript issue is a stupid mistake I made, I exported bibtex instead of better-bibtex. So I edit the issuse and delete that one.

retorquere commented 4 years ago

I have a fix in mind, will take a few days.

retorquere commented 4 years ago

@duncdrum does something similar hold for Hiragana/Katakana/Kanji?

retorquere commented 4 years ago

@forallsunday can you right-click that reference and send a BBT debug report from the menu that pops up?

retorquere commented 4 years ago

I have a sample reference with:

with the pattern [auth.etal][veryshorttitle][year] (first author, first word of title, year), it would usually generate citekey higuchinippon2014; if I treat each character as a separate words, I get higuchinichi2014. I have no idea which of these is preferable, or sensible. I have zero knowledge of Chinese.

duncdrum commented 4 years ago

I wouldn't really activate this for Japanese. The transliteration of individual characters varies quite a bit more according to wether they appear as 1gram, or part of a 2gram. higuchinippon2014 is perfect imv. higuchinichi2014 is technically correct but a bit funny and rather misleading.

In Chinese you might cut a 2 character word into two, but the transliteration would mostly be the same, so Zhongwen (one word) becomes ZhongWen or Zhong if at the end of the 4 character string. All still perfectly usable, and a solution to the OPs problem.

retorquere commented 4 years ago

It would solve the OPs problem, but you say that the result is funny and misleading? higuchinippon2014 is what BBT currently generates. higuchinichi2014 is what the proposed change would generate.

retorquere commented 4 years ago

Wait, is 日本型排外主義: 在特会・外国人参政権・東アジア地政学 Japanese, not Chinese? Well that's not great. According xregexp, "Han" (which is what I took to be Chinese) is covered by [\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9], but that causes the change from higuchinippon2014 to higuchinichi2014.

Which of https://github.com/slevithan/xregexp/blob/master/tools/output/scripts.js constitutes Chinese?

retorquere commented 4 years ago

Man I feel so culturally limited right now. I swear I've traveled. Just not East.

duncdrum commented 4 years ago

At a first look I'd say their han is the supergroup CJK. 日本型排外主義: 在特会・外国人参政権・東アジア地政学 is definitely Japanese though. both citekeys pick this up though, and use Japanese transliteration.

retorquere commented 4 years ago

Supergroup?

Dammit. This:

console.log("日本型排外主義: 在特会・外国人参政権・東アジア地政学".replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9])/g, '--$1--'))

outputs

--日----本----型----排----外----主----義--: --在----特----会--・--外----国----人----参----政----権--・--東--アジア--地----政----学--

Which means the Han range according to xregexp picks up Japanese.

The transliteration is done by a different library (or two, if you have Kuroshiro enabled). Xregexp has it's own ideas on character ranges. Back to the drawinf board. Is it unreasonable to expect something called Han to pick out Chinese?

retorquere commented 4 years ago

Damn. As per https://medium.com/the-artificial-impostor/detecting-chinese-characters-in-unicode-strings-4ac839ba313a and https://salesforce.stackexchange.com/questions/127565/regular-expression-to-find-chinese-characters:

The most commonly used CJKV ideographs are found in the Unicode CJK Unified Ideographs Block*. Many of these characters are used by multiple languages, and what will make your regex difficult is that the characters aren't separated by language. There is no sub-block for "Just Chinese" or "Chinese and Japanese". Unicode has the characters ordered by radical and stroke number, which means that they are all interspersed. Your regular expression would have to look for a very large number of small ranges and individual code points.

I don't mind "a very large number of small ranges and individual code points" (although I'd need help in identifying them), but if characters can be used in both Chinese and Japanese, then there's no general solution to this that won't piss someone off. If a (perhaps convoluted) list of only-but-all Chinese characters is simply not possible because they do share logograms, I can add a filter function that allows to do what you want. It'd look something like

[Title:split-logograms:select=1:lower]

I'm open to suggestions for the filter name. split-logograms doesn't really sound great to me.

duncdrum commented 4 years ago

I can help you with the identifying ranges part, but the question is what the goal is. There are a comparatively small number of codepoints that only appear in a specific language. So if one of them is used, we can be certain to deal with one, but not the other.

But take, e.g. 日本 this word exists in all CJK languages, and there is no guaranty that a given string will contain one of the unique markers.

The same result could be achieved by quering the ucd directly.

duncdrum commented 4 years ago

The maybe obvious question why not implement a fixed stringlength limit for han citekeys, and remind me again do we have access to Zoteros and Juris-m's language entries?

retorquere commented 4 years ago

Yeah so "detect Chinese characters" is a non-starter. That's good to know.

I do have access to the language field, but they're free-form fields -- there's no standard way to determine that the field means to express "Chinese". I could detect the literal string "Chinese" of course, but I do language detection for English, where I have to resort to also detecting "Anglais", and this monstrosity, to semi-reliably detect "I mean to say this item is in English".

Given the fact that I can't even tell apart written Chinese and Japanese (and probably Korean? More?), I wouldn't know where to begin.

forallsunday commented 4 years ago

Thank you for helping me. I clicked the report button in Zotero. And I know it is diffcult to determine whether the tittle is Chinese or Japanese. For now, the translation of tittle is correct , so I use 'substring' to limit the length of citekey. The format is: [auth:capitalize][year][veryshorttitle:substring=1,12] The tittle in Chinese: 航拍视频中运动目标的检测与跟踪算法研究 if don't use sub string, the citekey is: LiWenHui_2014_HangPaiShiPinZhongYunDongMuBiaoDeJianCeYuGenZongSuanFaYanJiu After using substring: LiWenHui_2014_HangPaiShiPi

However, just cut off in the 12th letter of tittle isn't always be right, sometimes the substring will be kind of confusing. So I'm thinking if there's a way to detect the capital letter, and keep relevant word when exporting, comparing to detect the language of tittle and seperate it, may this way will be easier. For example, Finding the first four capital letters : H, P, S, P and keep relevant words: HangPaiShiPin the citekey will be: LiWenHui_2014_HangPaiShiPin This citekey is more accurate and readable.

retorquere commented 4 years ago

What is the report ID? I can't tell which user sent what debug report.

I understand the problem. I just don't know how I'll solve it. I'd rather not resort to re-parsing the citekey for capitals, as capitals can end up in the key for a variety of reasons and are not necessarily word boundaries (e.g. an author named McIntire). A filter like split-logograms (or whatever we end up calling it) will be better.

blip-bloop commented 4 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5325 ("split-logograms")

Install in Zotero by downloading test build 5.1.170.5325, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

forallsunday commented 4 years ago

I report again and the report ID is: 264NYYVG-apse Yes, you are right. And I am thinking is the citekey must be English? I tried to use Chinese citekeys in vscode-latex-workshop, and it compiled well. So maybe making a toggle which can set the citekey English or original language is another way.

retorquere commented 4 years ago

When I import the reference in 264NYYVG-apse and generate a citekey with [auth:lower]_[year]_[shorttitle3_3], I get ribunhui_2014_KoHakuShi, not liwenhui_2014_HangPaiShiPinZhongYunDongMuBiaoDeJianCeYuGenZongSuanFaYanJiu.

Citekeys don't need to be English. If you go into the BBT preferences, turn off Force citation key to plain text, and set the pattern to [Auth]_[year]_[Title:select=1,3] (note the capitals in the pattern, these are explained here), and you'll get Chinese keys. I can add a filter function (which I'm calling split-logograms for now, but feel free to suggest something else) so that [Auth]_[year]_[Title:split-logograms:select=1,3] would make separate words from the logograms, and then select the first 3.

blip-bloop commented 4 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5328 ("remove spaces always")

Install in Zotero by downloading test build 5.1.170.5328, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

duncdrum commented 4 years ago

However, just cut off in the 12th letter of tittle isn't always be right, sometimes the substring will be kind of confusing.

In Chinese you might cut a 2 character word into two, but the transliteration would mostly be the same, so Zhongwen (one word) becomes ZhongWen or Zhong if at the end of the 4 character string.

@forallsunday You are right that a proper detection of ngram length is preferable, but in the short term wouldn't you agree that HangPaiShi would be preferable to the full string? it should be Hangpai Shipin but unlike the Japanese were nippon 日本 [Japan] becomes nichi 日 [day] I feel a broken binom is still serviceable in Chinese.

retorquere commented 4 years ago

But split-logograms would fix that right? 5328 has it.

duncdrum commented 4 years ago

no split-logograms would i think create HangPaiShi if set to 3 which in my view is a great improvement. The title actually has binom-binom-single so the first three characters is HangPaiShi but the first three words is Hangpai Shipin zhong. Doing the one character one word thing, will still solve the much bigger problem of ridonculously long titles. There are NLP libs that would tell us what the structure of a given character sequence is, I just don't think its worth the effort / overhead. Curious to hear what the OP thinks.

forallsunday commented 4 years ago

@duncdrum Yes, you are right. Theoretically, Hangpai Shipin zhong is more accurate. In practice, HangPaiShi is good enough.

@retorquere I tried the test build 5.1.170.5328, it worked fine. Thank you very much. Appreciate

retorquere commented 4 years ago

So I can't distinguish binom from single without NLP? Definitely not doing NLP.

retorquere commented 4 years ago

Any suggestions for an alternate name for the filter?

forallsunday commented 4 years ago

something like 'charactor-segmentation' ?

retorquere commented 4 years ago

Too generic. It specifically picks out the characters XRegExp deems Han. It won't touch any other character.

retorquere commented 4 years ago

I also prefer verbs for filters

duncdrum commented 4 years ago

how about split-ideographs as the regex really builds on the ideograph definition of unicode.

forallsunday commented 4 years ago

I think 'pinyin' is good. It's concise, but difficult for non-Chinese-users to understand. However, since the user cites the Chinese literature, I think only Chinese user could use this filter. Bty, pinyin is a kind of 'word' which translated from Chinese to English. For example, the pinyin of '这是中文' is 'Zhe Shi ZhongWen'.

retorquere commented 4 years ago

@forallsunday I'm not sure whether you're making a general comment or suggesting something specific.

forallsunday commented 4 years ago

just a suggestion

retorquere commented 4 years ago

It's not clear to me what you're suggesting, sorry.

forallsunday commented 4 years ago

It's okay. Anyway, thanks for helping me. Hope you find a better name.

retorquere commented 4 years ago

I'm not familiar in the field. If nothing else is suggested, I'm going with Duncan's suggestion.

retorquere commented 4 years ago

Just to be clear: I don't have an opinion on the matter. Something-something pinyin would be fine to me, but pinyin is not a verb.

forallsunday commented 4 years ago

understand. split-ideographs is a good one.

retorquere commented 4 years ago

With the entry you sent earlier in the log, [Auth]_[year]_[Title:split-logograms:select=1,3] generates ribunHui_2014_kohakuShi.

retorquere commented 4 years ago

(that's with "force to plain-text" on)

blip-bloop commented 4 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5330 ("test case for #1353, fixes #1353 (#norelease)")

Install in Zotero by downloading test build 5.1.170.5330, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

forallsunday commented 4 years ago

That's weird, I copy and paste the report id from Zotero. And I can't understand ribunHui_2014_kohakuShi , it sounds like Japanese.

retorquere commented 4 years ago

I had Kuroshiro on; without it (and with [Auth]_[year]_[Title:split-ideographs:select=1,3]), I get LiWenHui_2014_HangPaiShi.

retorquere commented 4 years ago

Acceptable? Because then I can cut a new release.

forallsunday commented 4 years ago

Yes, thank you very much

retorquere commented 4 years ago

Alright, the new release is building and should drop in 30 minutes or so.

happyTonakai commented 4 years ago

[Title:split-ideographs:select=1,3] works well. @retorquere But how do I get the lower case of the title? I use [auth:lower][year][Title:split-ideographs:select=1,1:lower] but get liwenhui2014Hang. This works for everything else but ideographs.

retorquere commented 4 years ago

You don't have to @tag me, I get notifications for anything that's posted to the issue tracker.

Can you right-click an item where this happens and send a debug-log from the menu that pops up?