retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.35k stars 288 forks source link

Escaping hyphens in the Pages field #943

Closed tom-a-horrocks closed 6 years ago

tom-a-horrocks commented 6 years ago

Hi all,

I'm trying to export to bibtex the citation for a journal article with page numbers "16-1", "16-2", "16-3", and "16-4". I'd like the page range to appear in bibtex as '16-1--16-4'. Unfortunately, if the Pages field in Zotero is '16-1-16-4', then all hyphens are converted to en dashes and the corresponding bibtex field is '16--1--16--4'. Is there any way to escape hyphens here, or alternatively force '16-1' and '16-4' to be interpreted as strings?

tom-a-horrocks commented 6 years ago

After a bit more reading I've discovered this is more of a bibtex issue. I've tried including an @string definition for a hyphen, but unfortunately that is also converted to an en dash. I have found one solution is to include a command in the .bib's preamble:

\documentclass{article}
\begin{filecontents}{test.bib}
@preamble{{\providecommand*\hyphen{-}}}

@article{test,
  author  = "Other, A. N.",
  journal = "J. Irrep. Res.",
  title   = "Some things I did",
  pages   = "081401\hyphen 1--081401\hyphen4",
  year    = "2011"
}
\end{filecontents}
\begin{document}
\nocite{*}
\bibliography{test}
\bibliographystyle{ieeetr}
\end{document}

https://tex.stackexchange.com/questions/21773/hyphenating-a-number-in-the-bibtex-pages-field

Is it at all possible to do this within zotero/better-bibtex? I'd like to avoid editing the .bib directly if possible.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.0.116.6221.issue-943 ("adjust test cases for #943").

retorquere commented 6 years ago

OK so the hyphen issue is partly my fault, as BBT was a little zealous in changing anything dash-y into en-dashes. 6221 changes that. That should make what you want to do easier. Not trivial though.

There are two ways to get \hyphens in that field:

For the preamble you'll have to use a postscript in any case as it stands. I am considering adding a preamble field, but I think I'd have to add two (what works for BibTeX will not necessarily work for BibLaTeX). The postscript would look like

if (Translator.BetterBibLaTeX) {
  if (!Translator.preambleWritten) {
    Zotero.write('@preamble{{\\providecommand*\\hyphen{-}}}\n');
    Translator.preambleWritten = true;
  }

  if (this.has.pages) this.has.pages.bibtex = this.has.pages.bibtex.replace(/([0-9])-([0-9])/g, '$1\\hyphen$2');
}

which means:

retorquere commented 6 years ago

really need that feedback.

bothide commented 6 years ago

As far as I recall, a page range in a bib file should always be given as "1-3", i.e., with a single hypen. Depending on the .bst file, the single hypen for page range in the .bib file will be expanded to an em-dash or, in some rare cases, to an en-dash.

retorquere commented 6 years ago

I think that's mostly what it does now, right? Have you tested the new behavior?

bothide commented 6 years ago

I have not tested it, but I believe you. My comment was meant as just that. Another comment is that the page range "16-1 -- 16-4" is in many journals written as "16(4)".

tom-a-horrocks commented 6 years ago

Thanks for your work on this. Note that in the meantime I've simply used 16:1-4, which should be fine for me.

The page numbers '16-1',...'16-4' are what are printed on the conference abstract itself. What's happening is that '16' is an electronic article identifier (separate to DOI). What complicated matters is that there's no field for this identifier except perhaps for issue, which isn't available for conference abstracts (@inproceedings) -- and sometimes journal articles have an issue number AND an electronic identifier anyway. I guess writing 16(1-4) in the page field may be a realistic compromise?

Note that these identifiers can change significantly. For example, I have another which is We MIN 06, and I'm yet to settle on a principled way to get these into the bibliography.

retorquere commented 6 years ago

@njbart, is it correct I should use a single hyphen for page ranges? This is mostly related to import, because I'm going to pass on what's in the pages field as-is on output, only translating a unicode en-dash to --, and unicode m-dashes to --- for output.

bothide commented 6 years ago

Many BibTeX style files (.bst) files will do a search and replace, so that "-" is replaced by "--" in the output (.bbl file). This is certainly true for all the Physics journals that I have published in.

However, some journals use an en-dash in the page range (I seem to recall that I have seen this in some French journals, but don't quote me on that). So always using "--" this will call for extra corrective work in the .bbl file for these journals.

On 10 April 2018 at 13:30, Emiliano Heyns notifications@github.com wrote:

@njbart https://github.com/njbart, is it correct I should use a single hyphen for page ranges? This is mostly related to import, because I'm going to pass on what's in the pages field as-is on output, only translating a unicode en-dash to --, and unicode m-dashes to --- for output.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/retorquere/zotero-better-bibtex/issues/943#issuecomment-380066323, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlBJDldgYhUXjgrq2lq9salaVnbDJPRks5tnJfhgaJpZM4THybU .

--


Emeritus Professor Dr Bo Thidé https://www.researchgate.net/profile/Bo_Thide Swedish Institute of Space Physics (IRF), P. O. Box 537, SE-751 21 Uppsala Phone: Office +46184715902 Mobile/Cell +46705613670 Home +46184951801 Visiting address: Ångström Laboratory, Rm# 84108, Lägerhyddsvägen 1, Uppsala

retorquere commented 6 years ago

I'm not always using --, I'm just translating U+2013 to -- and U+2014 to ---. Hyphens (regardless of how many you have) will be left untouched.

bothide commented 6 years ago

I was referring to the average user who uses "12--17" instead of the more preferable "12-17" in his/her .bib file.

retorquere commented 6 years ago

If I can be sure that a user never wants a double-hash in the pages field (@njbart?) then perhaps I could replace them, but it seems iffy.

In some cases, I need some work to be left for cleanup by the user; can't algorithmically catch them all ¯\_(ツ)_/¯. A postscript is always an option.

bothide commented 6 years ago

Please visit https://verbosus.com/bibtex-style-examples.html to find examples of how .bib entries are entered in the best way. Notice that page ranges shall be separated by a single "-". This hyphen is not just a character, but rather a page number separator that is to be replaced by a proper dash of the correct type (or something else, depending on the requirements of the actual publisher). Notice also that pure, single numbers, such as in "year", "number", "month", "volume", "series" and so on, are not to be enclosed by brackets or inverse commas.

retorquere commented 6 years ago

I'd really rather hear from @njbart (or @plk); the biblatex processors are insanely lenient, so what works is not always how it's supposed to be. But in the meantime I can change the import back to single hyphen.

But if there's an U+2014 or U+2013 there, by assumption the user who entered this wants an em- or en-dash, so I'd rather stick to that.

njbart commented 6 years ago

From the biblatex 3.11 release notes: “Hyphens and dashes in page ranges will be transformed to \bibrangedash, commas and semi-colons to \bibrangesep.” (https://github.com/plk/biblatex/wiki)

So my understanding is that any number of consecutive hyphens or dashes, including U+2014 or U+2013 will all be transformed to \bibrangedash.

Protecting hyphens and dashes can be achieved by wrapping them in curly braces. So my guess is (untested though) that the OP could get the desired result by using, e.g., pages = {16{-}1--16{-}4} – though pages = {16{-}1-16{-}4} should be expected to work just as well.

As to a suitable heuristic for BBT distinguishing hyphen/dash chars that should not be protected (i.e. those intended to be ultimately mapped to \bibrangedash) from those which should, I guess something like “protect all strings consisting of consecutive hyphens or dashes, except for the longest such string” could do the trick:

BBT would map 16-1--16-4 to pages = {16{-}1--16{-}4}, 16--1---16--4 to pages = {16{--}1---16{--}4}, etc. (The second example would then be rendered as 16–1–16–4, where any visual distinction is lost again, and there’s nothing BBT would be protecting in a string such as 16-1-16-4, but this is the best I can think of.)

retorquere commented 6 years ago

Would a single em-dash (u+2014, usually translated to triple dash in latex) count as longer or shorter than a double hypen?

bothide commented 6 years ago

In good typography (a definition that varies from language/country to language/country), four different "dashes" are used:

  1. Hyphenation: "Andy Fairweather-Lowe", breaking a multisyllable word at the end of a line. In LaTeX: "-" (single "-").

  2. Range: "The years 1939-1945". In LaTeX: "--" (double "-").

  3. Separation: "Typesetting - a difficult skill". In LaTeX: " --- " or (e.g., in American typography) "---" (triple "-").

  4. Negation: "The temperature is -3 degrees C". In LaTeX: "$-$" (math mode, single "-").

On Fri, 13 Apr 2018, 08:04 Emiliano Heyns, notifications@github.com wrote:

Would a single em-dash (u+2014, usually translated to triple dash in latex) count as longer or shorter than a double hypen?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/retorquere/zotero-better-bibtex/issues/943#issuecomment-381032924, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlBJOeAfFNHulKQa9Rn9osBoV1hb0kMks5toD-2gaJpZM4THybU .

retorquere commented 6 years ago

Except if @njbart 's interpretation of the biblatex wiki is correct, any number of non-braced consecutive dash symbols of various kinds would constitute a \bibrangedash. biblatex has it's own parsing and interpretation rules, and will output TeX code as a result, but the input isn't necessarily interpreted as (La)TeX itself would.

@njbart offered a heuristic to determine what dash-like things to brace and which not, but "longest" to me is ambiguous on whether it means pre-processing length (in which case double-hyphen would be longer than em-dash) or post-processing length (in which case em-dash is longer than double-hyphen).

Not at all sure I'm going to do this yet, as I'd have to do further parsing of the pages field for multiple ranges, and parsing of Zotero input is brittle. But I'm considering doing it.

njbart commented 6 years ago

What I had in mind was post-processing length, i.e., en-dash=double-hyphen longer than single-hyphen (and em-dash=triple-hyphen longer than double-hyphen – though I’m not sure the latter situation ever occurs in the wild).

retorquere commented 6 years ago

Neither have I, but nothing surprises me at this point. The state of references ready-to-import for Zotero is not stellar, and all kinds of stuff ends up in the database.

moewew commented 6 years ago

Hope you don't mind me butting in here. I can only say things with confidence for the biblatex side. BibTeX as you know is an inhomogeneous realm of .bst files that do not always follow the same line.

@bothide is right when they say that the dash can be considered a kind of meta character in the pages field. For the standard BibTeX styles as far as I can see what happens is simply that single -s are doubled up to become -- (this is done using the function $substring that treats braces and macro construct simply as ASCII chars, so no amount of brace protection can help here). However, the BibTeX documentation states (http://mirrors.ctan.org/biblio/bibtex/base/btxdoc.pdf, p. 11):

pages One or more page numbers or range of numbers, such as 42--111 or 7,41,73--97 or 43+ (the ‘+’ in this last example indicates pages following that don’t form a simple range). To make it easier to maintain Scribe-compatible databases, the standard styles convert a single dash (as in 7-33) to the double dash used in TeX to denote number ranges (as in 7--33).

So it seems that back when BibTeX was devised the preferred way was actually a double dash and the single dash was only used for backwards compatibility reasons. I don't know if there are any more authoritative sources nowadays that recommending - over --, but popular use may simply have made - the more prevalent and the de-facto standard: It's simpler to type, after all.

For biblatex the 'meta' capacity of - is made clearer by the fact that pages is not a literal field that is largely left as is, but a range field that is parsed by Biber.

I do, however, not agree with the sentiment that numeric fields should always be written without braces. It is a feature of the .bib file syntax that "numerical values" do not need braces (or quotes):

For numerical values, curly braces and double quotes can be omitted.

(Nicolas Markey: Tame the BeaST, p. 20, http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf)

But this is clearly worded as optional here and I haven't seen anyone else endorsing leaving out the braces. In fact pages = 1-45, will fail, so pages = 1, is risky if you want to add something later on. The risk is lower for export tools such as yours here, but I still think it is better to go with the braces. Still the only advice I have seen with regards to number fields and braces is to always write the braces even if they are not required.

biblatex actually has two levels at which it can deal with page ranges: Biber parses page ranges in the pages field, but pages as given in the optional postnote argument to \cite and friends are not passed on to Biber and are parsed by biblatex with (La)TeX code.

  1. Biber parses the pages field as a range field and tries to make sense of it from that perspective using Perl RegEx.

    Roughly, Biber splits the field at , and ; and then treats each bit separately. At first a RegExp that matches "(non-dash chars)(dash chars)(non-dash chars)" tries to read off the start and end of a page range. If that does not match, a fallback pattern "(any char)(at least two dash chars)(any char)" tries to find the start and end of the range. The range is then written to the .bbl as <start>\bibrangedash <end>.

    Note that brace protection does not do anything for Biber. Furthermore, any number and all kinds of dashes are treated equally as long as RegExp recognises the character as dash-like, the only exception being the fallback pattern that specifically needs at least two dash-like characters to match (so pages = {16-1--16-4}, with double ASCII dash works, but pages = {16-1–16-4}, with U+2013 does not; adding braces in the obvious position changes nothing for Biber).

    If all else fails, the field is read as literal and just dumped to the .bbl file without digestion. A warning is issued in that case.

    https://github.com/plk/biber/blob/d88ad8e580cffb1f4dc4a676e9a794a0b9e9b06b/lib/Biber/Input/file/bibtex.pm#L994-L1033

  2. biblatex also parses pages and other fields potentially containing page ranges on a LaTeX level. The passage of the biblatex Wiki @njbart quotes is referring specifically not to the pages field, but rather to postnote and friends that do not get pre-chewed, normalised input from Biber. Ideally the pages field would still be formatted in a way that it can also be parsed by the LaTeX range parser since custom styles may well apply the range parser also for pages. This will prove difficult due to an unforeseen interference in biblatex's macros, so need not be your primary aim at the moment.

    The LaTeX range parser builds on low-level LaTeX and can only deal with Unicode characters if a Unicode engine is used (XeTeX, LuaTeX). With pdfTeX only ASCII chars are gracefully handled. So it is a good idea to only export ASCII chars to the pages field if possible (I believe you are already doing that).

    The range parsing then works similar to Biber's routine. It splits at ;, , and \bibrangessep. Each chunk is then split up at the first occurrence \bibrangedash, -- or - (-- is never matched only as -). The command then prints the start and end of the range with \bibranmgedash in between.

    Certain characters can be hidden in this step by wrapping them in curly braces. Unfortunately this only works theoretically at the moment, because the \ifpages test can't deal with these hidden characters and the braces surrounding them. This means that presently a hyphen needs to be hidden with a command \newcommand*{\pagehyphen}{-} that can be made invisible itself with \NumCheckSetup{\let\pagehyphen\@empty}: then \cite[16\pagehyphen 1-16\pagehyphen 14]{sigfridsson} gives the expected output. I'll have a look if \cite[6{-}1-6{-}14]{sigfridsson} can be salvaged, but that looks really tough.

What does that mean for you?

bothide commented 6 years ago

Let me just repeat my comment that a convenient (and, seemingly, nearly a de facto standard) way of writing a page range of the type 6-1 through 6-14 is 6(14). This is used by, e.g., the American Physical Society publications such as the Physical Review journals.

retorquere commented 6 years ago

IOW the current behavior in the regular release is OK as-is?

moewew commented 6 years ago

I don't use BBT (or Zotero for that matter), so verification would have to come from someone who does. But from what I can read here things should be fine if BBT does not change - to -- any more (I think you mentioned that build 6221 does not do this any more, is that part of the regular release now?).

I had a look at normalizeDashes and https://github.com/retorquere/zotero-better-bibtex/blob/753f0cc27750f532cf560f76c5cd2991f3d9f8b1/translators/bibtex/reference.ts#L401 still seems to convert some - to --.

normalzeDashes also seems to replace U+2012 (figure dash) with an em-dash https://github.com/retorquere/zotero-better-bibtex/blob/753f0cc27750f532cf560f76c5cd2991f3d9f8b1/translators/bibtex/reference.ts#L399 I'd probably go for an en-dash or even a hyphen instead.

retorquere commented 6 years ago

I'll get those changed later today.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.0.129.6395.issue-943 ("adjust tests for #943").

tolot27 commented 6 years ago

I still get the following warning in my BBT exported BibLaTeX file: @% ? hyphen found in pages field, did you mean to use an en-dash?

I thought - will now be kept as is and it is not required to put -- between pages. What did I miss?

retorquere commented 6 years ago

Fixed, will be in the next release.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.0.137.6668.master ("re-fixes #943").

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.