retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.43k stars 291 forks source link

[Juris-m] citekeys #483

Closed duncdrum closed 8 years ago

duncdrum commented 8 years ago

autogenerated cite keys are a pain when used in documents not written in english. Defaulting to __XXXX where "xxxx" is the year, and non-latin characters aren't processed. screenshot 2016-04-21 20 34 56

There are two options that relate to juris-m. A. When producing a mono-lingual document, e.g. a chinese text citing chinese sources.

  1. with force ascii disabled the cite key should look like this 名字-題名-2001. Just take the unicode from author and title field and put it in the corresponding spots.
  2. Currently it fails to output anything from either author or title field, only date/year works since they use latin numerals (but that will not always be the case).

B. Producing multi-lingual document, e.g. using CJK references in a german text.

  1. BBT would ideally check for primary language of each record, and that of the juris-m primary language setting. If non-latin records appear in a latin primary language the transliteration field with a latin character language id should be used to generate citekey hanzi_timing_2001. (which could be ascii'd or not depending on force ascii setting, so é -> e, ö -> o)
  2. currently it doesn't check the language variant fields for author or title at all, it outputs the language id of each record, but i don't think it is aware of the juris-m primary language setting. resulting in a nondescript string.

I should note that i have no experience with producing primarily russian, japanse, … latex documents, but i m sure that there are more detailed requirements for working biblatex files among users from these languages. I think the trick is to use the language fields to decide what bbt should do. These aren't often used in regular zotero , but are relevant in juris-m. Guessing the script based on unicode-range is another way, but is likely to lead to trouble since script is not enough to determine language.

retorquere commented 8 years ago

A.1/A.2: the [auth] etc pattern follow JabRef, which does folding; in your case, it looks like you'd want to use [Auth], or anything from the table at https://github.com/retorquere/zotero-better-bibtex/wiki/Citation-Keys#configurable-citekey-generator, which will give you what's in the Zotero reference more or less verbatim. Folding will only be applied on the generated key of you have "force..." checked.

B.1: translators don't have access to the language setting of Juris-M, so I can't check for that, but I do have access to the reference language setting. Is there a list of languages considered "latin"?

B.2: that is correct, I currently don't check those, as the recent Juris-M compatibility currently only means "doesn't error out immediately under Juris-M"; the behavior is still very much Zotero-oriented. Issue reports such as these can change that though. I think the sentence ending B.2 is cut off?

Guessing the encoding reliably is tricky, exacerbated by the fact that Javascript unicode handling is atrociously wacky. I'd rather depend on the reference telling me.

duncdrum commented 8 years ago

@retorquere

I have the feeling this won't catch all scenarios, but should cover a lot of ground. Just to be clear the reference language setting covers the language tags for each variant field of a juris-m entry? Or just the contents of the default zotero language field?

retorquere commented 8 years ago

WRT A1/2 I'd rather not change the behavior of an existing (and widely used) pattern field. But wouldn't just using [Auth] always do what you want?

On the 2nd bullet.... would that require a toggle? In Zotero the proposed algorithm, is always going to find the (sole) version, so I'm fine with [auth] doing that, as it doesn't in practice entail a behavior change.

Currently BBT does only one thing with the Zotero language field, which is deciding whether to TitleCase certain fields. BBT never had access to field-specific language settings, so we're free to decide how to deal with those.

duncdrum commented 8 years ago

Yes you are correct in the first case [Auth] should work. In the second case we have to keep in mind that while most people have use for romanised transcription in biblatex, not everybody is using latinised transcriptions . Maybe somebody transcribes arabic into hebrew? The toggle just declares I use the alphabet (and other things). Whereas no toggle says thank you no use for alphabet.

retorquere commented 8 years ago

But in the case you describe, there would be no latin fallback, so it'd just pick the primary language. The toggle wouldn't actually change the behavior in this case.

duncdrum commented 8 years ago

I might have misunderstood you, what would be the default behaviour: in BBT preferences I change the citationkeyformat to [jurism] which uses [Auth], [Title] instead of [auth],[title] resulting 名字-題名-2001 If I also select "force citation key to ascii" we would use the romanised transcriptions where they are present, to get hanzi_timing_2001 and [auth] as it does now for english works?

retorquere commented 8 years ago

But why add a new pattern for that if it just does [Auth][Title]? If you're referring to the [zotero] pattern, that just replicates the Zotero key generation pattern, but unless I'm mistaken, Juris-M uses the Zotero BibTeX translator and so would get the same citation keys. I added it because there was no way to reliably assemble that pattern from more basic BBT patterns, and I wanted to help people who wanted to migrate, but if you're migrating from Juris-M BibTeX, [zotero] should give you exactly the same keys.

But I meant that if the process were to be, for any given author / title:

  1. If there is a latin version, use that
  2. If not, but there is a version that matches the language specified for the reference, use that
  3. If none of the above, use the first version available

this would work without change for Zotero, as 1. and 2. will always fail (no language versions) and it will pick what it always did. Still don't see what a preference would change (and I try to be conservative about adding prefs these days)

duncdrum commented 8 years ago

Your process makes sense, this would be a large quality of live improvement for jurism. If there are edge cases where this doesn't work, I m sure they will make themselves heard.

retorquere commented 8 years ago

Do you know how Frank uses that list to decide what is Latin?

duncdrum commented 8 years ago

I don't think he does.

retorquere commented 8 years ago

Do you have a sample I can work from? Preferably submitted by using right-click and selecting "Report Better BibTeX error"?

duncdrum commented 8 years ago

@retorquere bbt-error: 9I9GTW66 In juris-m Title, Editor, Publisher, and Place all have an additional field for pinyin (zh-alac) transcription, that is not exported into any bbt or zotero format via right-click -> export. See the citation.

Yang Bojun, 楊伯峻, ed. Chunqiu Zuozhuan Zhu, 春秋左傳注. 4 vols. Revised Edition. Beijing, 北京: Zhonghua shuju, 中华书局, 2000.

the citekey should be something to the effects of:

楊伯峻_春秋左傳注_2000

for non latin script users, or

Yang_ChunqiuZuozhuan_2000

for a romanised version

retorquere commented 8 years ago

OK that was dumb on my part -- the current release strips out the multi-lang parts simply to make the tests pass 😒 . Could you try with https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.49-circle-2352.xpi -- that should leave the tests intact.

duncdrum commented 8 years ago

Error-ID: CHNKKKBH has the full monty again.

retorquere commented 8 years ago

I have a version that works, but it relies on specifying the preferred search order for language alternates. Detecting what values are romanized is sort of possible, e.g. by stripping everything that is not in the unicode letter class and seeing which alternate has the most letters left, but it feels a little iffy as an algorithm. The language preference order seems cleaner. What do you think? https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.49-circle-2354.xpi does this, it currently only has zh-alalc97 as a language preference.

retorquere commented 8 years ago

There is also a version at https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.49-circle-2359.xpi which uses no language preference order but just cycles through each language present in the reference and picks the citation key that's the longest over those given languages. Which would you prefer?

duncdrum commented 8 years ago

FF46.0 OSX 10.11.4 JURISM 4.0.29.8m37 @retorquere -2354 didn't do anything the key remained __2000-4 even after unpinning and resetting cache. -2359 on the other generates yang_bojun_chunqiu_2000 which is wonderful, and a big step up compared to the old way of generating keys. This fits my needs perfectly.

retorquere commented 8 years ago

2354 only picks those languages that are explicitly specified, so in the __2000-4 case it's most likely it didn't use zh-alalc97 for the alternates -- 2354 only has that as a preference and will ignore all the rest.

2359 does the following, for each reference:

which has the downside that if you mix and match languages, the results may sometimes be surprising. 2354 with a language pref order en,zh-alalc97,uzbek can (when carefully tuned) in that particular case yield better results. I have no idea how common such mixing and matching is.

There is one further option -- I could cycle through these languages for each separate part of the key pattern instead of the key pattern as a whole, but that would require deeper changes in BBT. If you reckon mix-n-match is common enough to be an issue, I can give it a stab.

retorquere commented 8 years ago

Errrr... thinking about it a little more, it would require a rewrite of substantial parts of the key generator, as cycling through the languages would have to know the results after any stuff like abbr and nopunct are applied, which I can't know when I'm picking the alternate. So not impossible, but non-trivial.

duncdrum commented 8 years ago

i just double-checked but all fields were either zh, zh-alalc97, en I have no clue why 2354 didn't pick it up.

As for mix n match I m sure its pretty common, but what ultimately decides the most convenient citekey is the language or input method of the tex document. Jurism covers this via language settings -> UI locale. I would argue that for now even funky citekeys are better and more descriptive then __YYYY-(how ever many sources ones library has from that year). I d say quick n dirty does the trick for now, until we can actually use zotero preferences.

retorquere commented 8 years ago

2354 didn't pick up 'en', probably. I'll see what I can do with the UI locale.

retorquere commented 8 years ago

New try: https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.50-circle-2365.xpi. This will just cycle through all available languages in the reference for each pattern part, with one limitation: if you have something like auth, it will try each language for all authors at once, so if you have 3 authors in 3 different languages, some will probably be dropped. Not going to try and fix that, it would get too complex.

duncdrum commented 8 years ago

with 2365 I'm back to __2000-4 about multiple authors: I don't think it ll be much of a problem, as the bibliographical item, already unifies different naming conventions to fit the primary language of the publication. This is with [zotero].

retorquere commented 8 years ago

Ah, yeah, [zotero] is a little autistic that way -- it does exactly what the stock zotero citekey generator does, every flaw included, by design. Try [auth][year][0].

duncdrum commented 8 years ago

ahh now we have YangBojun2000 which is great. I'm a little concerned that since [zotero] is the default, it might be off putting to users who give bbt a first try with jurism, but if you are happy with the current iteration feel free to close the issue. I d be happy to write something for the bbt-wiki, about jurism and bbt caveats, for latex users.

retorquere commented 8 years ago

That is actually a point I hadn't considered. https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.50-circle-2366.xpi should be better.

retorquere commented 8 years ago

Is there documentation on what the Juris-M "language" preference pane does?

retorquere commented 8 years ago

If you could verify 2366, I can merge and close this.

duncdrum commented 8 years ago

2366 works great even with [zotero]. thanks for putting in the effort for us jurism users.