Closed forallsunday closed 4 years ago
Is every character a word in Chinese?
In the mail I received when the issue was opened I also see something about a postscript, which I don't see here. If that's still relevant, open a separate issue for that please. I don't want to be handling multiple concerns in one issue.
Is every character a word in Chinese?
In the mail I received when the issue was opened I also see something about a postscript, which I don't see here. If that's still relevant, open a separate issue for that please. I don't want to be handling multiple concerns in one issue.
In Chinese, a sentence consists of some charactors without space. Zoreto regards a sentence as a single word, which is actually quite common in many softwares. Some softwares like MS word, google chrome, have some kind of function for seperating the charactors into words in a sentence. I think the easier way to keep citation key of Chinese title slim in the .bib file, is to keep the fisrt N words when exporting. Finding the first N capital charactors, and keep relevant words, maybe this will work by postscript. Sadly, I have no experience to use javascript.
And the postscript issue is a stupid mistake I made, I exported bibtex instead of better-bibtex. So I edit the issuse and delete that one.
I have a fix in mind, will take a few days.
@duncdrum does something similar hold for Hiragana/Katakana/Kanji?
@forallsunday can you right-click that reference and send a BBT debug report from the menu that pops up?
I have a sample reference with:
with the pattern [auth.etal][veryshorttitle][year]
(first author, first word of title, year), it would usually generate citekey higuchinippon2014
; if I treat each character as a separate words, I get higuchinichi2014
. I have no idea which of these is preferable, or sensible. I have zero knowledge of Chinese.
I wouldn't really activate this for Japanese. The transliteration of individual characters varies quite a bit more according to wether they appear as 1gram, or part of a 2gram. higuchinippon2014
is perfect imv. higuchinichi2014
is technically correct but a bit funny and rather misleading.
In Chinese you might cut a 2 character word into two, but the transliteration would mostly be the same, so Zhongwen
(one word) becomes ZhongWen
or Zhong
if at the end of the 4 character string. All still perfectly usable, and a solution to the OPs problem.
It would solve the OPs problem, but you say that the result is funny and misleading? higuchinippon2014
is what BBT currently generates. higuchinichi2014
is what the proposed change would generate.
Wait, is 日本型排外主義: 在特会・外国人参政権・東アジア地政学 Japanese, not Chinese? Well that's not great. According xregexp, "Han" (which is what I took to be Chinese) is covered by [\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]
, but that causes the change from higuchinippon2014
to higuchinichi2014
.
Which of https://github.com/slevithan/xregexp/blob/master/tools/output/scripts.js constitutes Chinese?
Man I feel so culturally limited right now. I swear I've traveled. Just not East.
At a first look I'd say their han
is the supergroup CJK
.
日本型排外主義: 在特会・外国人参政権・東アジア地政学
is definitely Japanese though.
both citekeys pick this up though, and use Japanese transliteration.
Supergroup?
Dammit. This:
console.log("日本型排外主義: 在特会・外国人参政権・東アジア地政学".replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9])/g, '--$1--'))
outputs
--日----本----型----排----外----主----義--: --在----特----会--・--外----国----人----参----政----権--・--東--アジア--地----政----学--
Which means the Han
range according to xregexp picks up Japanese.
The transliteration is done by a different library (or two, if you have Kuroshiro
enabled). Xregexp has it's own ideas on character ranges. Back to the drawinf board. Is it unreasonable to expect something called Han
to pick out Chinese?
Damn. As per https://medium.com/the-artificial-impostor/detecting-chinese-characters-in-unicode-strings-4ac839ba313a and https://salesforce.stackexchange.com/questions/127565/regular-expression-to-find-chinese-characters:
The most commonly used CJKV ideographs are found in the Unicode CJK Unified Ideographs Block*. Many of these characters are used by multiple languages, and what will make your regex difficult is that the characters aren't separated by language. There is no sub-block for "Just Chinese" or "Chinese and Japanese". Unicode has the characters ordered by radical and stroke number, which means that they are all interspersed. Your regular expression would have to look for a very large number of small ranges and individual code points.
I don't mind "a very large number of small ranges and individual code points" (although I'd need help in identifying them), but if characters can be used in both Chinese and Japanese, then there's no general solution to this that won't piss someone off. If a (perhaps convoluted) list of only-but-all Chinese characters is simply not possible because they do share logograms, I can add a filter function that allows to do what you want. It'd look something like
[Title:split-logograms:select=1:lower]
I'm open to suggestions for the filter name. split-logograms
doesn't really sound great to me.
I can help you with the identifying ranges part, but the question is what the goal is. There are a comparatively small number of codepoints that only appear in a specific language. So if one of them is used, we can be certain to deal with one, but not the other.
But take, e.g. 日本
this word exists in all CJK languages, and there is no guaranty that a given string will contain one of the unique markers.
The same result could be achieved by quering the ucd directly.
The maybe obvious question why not implement a fixed stringlength limit for han
citekeys, and remind me again do we have access to Zoteros and Juris-m's language
entries?
Yeah so "detect Chinese characters" is a non-starter. That's good to know.
I do have access to the language field, but they're free-form fields -- there's no standard way to determine that the field means to express "Chinese". I could detect the literal string "Chinese" of course, but I do language detection for English, where I have to resort to also detecting "Anglais", and this monstrosity, to semi-reliably detect "I mean to say this item is in English".
Given the fact that I can't even tell apart written Chinese and Japanese (and probably Korean? More?), I wouldn't know where to begin.
Thank you for helping me. I clicked the report button in Zotero. And I know it is diffcult to determine whether the tittle is Chinese or Japanese. For now, the translation of tittle is correct , so I use 'substring' to limit the length of citekey. The format is: [auth:capitalize][year][veryshorttitle:substring=1,12] The tittle in Chinese: 航拍视频中运动目标的检测与跟踪算法研究 if don't use sub string, the citekey is: LiWenHui_2014_HangPaiShiPinZhongYunDongMuBiaoDeJianCeYuGenZongSuanFaYanJiu After using substring: LiWenHui_2014_HangPaiShiPi
However, just cut off in the 12th letter of tittle isn't always be right, sometimes the substring will be kind of confusing. So I'm thinking if there's a way to detect the capital letter, and keep relevant word when exporting, comparing to detect the language of tittle and seperate it, may this way will be easier. For example, Finding the first four capital letters : H, P, S, P and keep relevant words: HangPaiShiPin the citekey will be: LiWenHui_2014_HangPaiShiPin This citekey is more accurate and readable.
What is the report ID? I can't tell which user sent what debug report.
I understand the problem. I just don't know how I'll solve it. I'd rather not resort to re-parsing the citekey for capitals, as capitals can end up in the key for a variety of reasons and are not necessarily word boundaries (e.g. an author named McIntire
). A filter like split-logograms
(or whatever we end up calling it) will be better.
:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5325 ("split-logograms")
Install in Zotero by downloading test build 5.1.170.5325, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
I report again and the report ID is: 264NYYVG-apse Yes, you are right. And I am thinking is the citekey must be English? I tried to use Chinese citekeys in vscode-latex-workshop, and it compiled well. So maybe making a toggle which can set the citekey English or original language is another way.
When I import the reference in 264NYYVG-apse
and generate a citekey with [auth:lower]_[year]_[shorttitle3_3]
, I get ribunhui_2014_KoHakuShi
, not liwenhui_2014_HangPaiShiPinZhongYunDongMuBiaoDeJianCeYuGenZongSuanFaYanJiu
.
Citekeys don't need to be English. If you go into the BBT preferences, turn off Force citation key to plain text, and set the pattern to [Auth]_[year]_[Title:select=1,3]
(note the capitals in the pattern, these are explained here), and you'll get Chinese keys. I can add a filter function (which I'm calling split-logograms
for now, but feel free to suggest something else) so that [Auth]_[year]_[Title:split-logograms:select=1,3]
would make separate words from the logograms, and then select the first 3.
:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5328 ("remove spaces always")
Install in Zotero by downloading test build 5.1.170.5328, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
However, just cut off in the 12th letter of tittle isn't always be right, sometimes the substring will be kind of confusing.
In Chinese you might cut a 2 character word into two, but the transliteration would mostly be the same, so
Zhongwen
(one word) becomesZhongWen
orZhong
if at the end of the 4 character string.
@forallsunday You are right that a proper detection of ngram length is preferable, but in the short term wouldn't you agree that HangPaiShi
would be preferable to the full string? it should be Hangpai Shipin
but unlike the Japanese were nippon
日本 [Japan] becomes nichi
日 [day] I feel a broken binom is still serviceable in Chinese.
But split-logograms would fix that right? 5328 has it.
no split-logograms would i think create HangPaiShi
if set to 3
which in my view is a great improvement.
The title actually has binom-binom-single
so the first three characters is HangPaiShi
but the first three words is Hangpai Shipin zhong
. Doing the one character one word thing, will still solve the much bigger problem of ridonculously long titles. There are NLP libs that would tell us what the structure of a given character sequence is, I just don't think its worth the effort / overhead. Curious to hear what the OP thinks.
@duncdrum Yes, you are right. Theoretically, Hangpai Shipin zhong
is more accurate. In practice, HangPaiShi
is good enough.
@retorquere I tried the test build 5.1.170.5328, it worked fine. Thank you very much. Appreciate
So I can't distinguish binom from single without NLP? Definitely not doing NLP.
Any suggestions for an alternate name for the filter?
something like 'charactor-segmentation' ?
Too generic. It specifically picks out the characters XRegExp deems Han
. It won't touch any other character.
I also prefer verbs for filters
how about split-ideographs
as the regex really builds on the ideograph definition of unicode.
I think 'pinyin' is good. It's concise, but difficult for non-Chinese-users to understand. However, since the user cites the Chinese literature, I think only Chinese user could use this filter. Bty, pinyin is a kind of 'word' which translated from Chinese to English. For example, the pinyin of '这是中文' is 'Zhe Shi ZhongWen'.
@forallsunday I'm not sure whether you're making a general comment or suggesting something specific.
just a suggestion
It's not clear to me what you're suggesting, sorry.
It's okay. Anyway, thanks for helping me. Hope you find a better name.
I'm not familiar in the field. If nothing else is suggested, I'm going with Duncan's suggestion.
Just to be clear: I don't have an opinion on the matter. Something-something pinyin would be fine to me, but pinyin is not a verb.
understand. split-ideographs
is a good one.
With the entry you sent earlier in the log, [Auth]_[year]_[Title:split-logograms:select=1,3]
generates ribunHui_2014_kohakuShi
.
(that's with "force to plain-text" on)
:robot: this is your friendly neighborhood build bot announcing test build 5.1.170.5330 ("test case for #1353, fixes #1353 (#norelease)")
Install in Zotero by downloading test build 5.1.170.5330, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".
That's weird, I copy and paste the report id from Zotero. And I can't understand ribunHui_2014_kohakuShi
, it sounds like Japanese.
I had Kuroshiro on; without it (and with [Auth]_[year]_[Title:split-ideographs:select=1,3]
), I get LiWenHui_2014_HangPaiShi
.
Acceptable? Because then I can cut a new release.
Yes, thank you very much
Alright, the new release is building and should drop in 30 minutes or so.
[Title:split-ideographs:select=1,3]
works well. @retorquere
But how do I get the lower case of the title?
I use [auth:lower][year][Title:split-ideographs:select=1,1:lower]
but get liwenhui2014Hang
.
This works for everything else but ideographs.
You don't have to @tag me, I get notifications for anything that's posted to the issue tracker.
Can you right-click an item where this happens and send a debug-log from the menu that pops up?
Hi, I love the better bibtex in zotero when citing the references, however, if the title is in Chinese, the citation key will be too long.
For example, if the format is set to
[auth:lower]_[year]_[shorttitle3_3]
, thenThe spelling is correct, but the citation key is shown as the whole title. Is there anyway to custimize the length of citation key? Something like "liwenhui_2014_HangPaiShiPin", which is keeping the first 4 words having the captical charactors when exporting to bib?