retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.25k stars 287 forks source link

Capitalization: Don't caps-protect name fields #384

Closed retorquere closed 8 years ago

retorquere commented 8 years ago

@nickbart1980 says:

Protecting two-field names is unnecessary on principle; only single-field names should be enclosed in one pair of braces, as currently [2.3.3].

retorquere commented 8 years ago

I already wrap and in braces (don't I?). So if I understand you correctly, names are treated differently from titles, in that they are not downcased by the bib processor then (otherwise caps protection would still be required to disclose user intent). But then why should one not just always do {Lastname}, {Firstname}? Seems much easier than the current caps preservation. In fact that then would maybe be the broader solution to preserve caps = all; just wrap the entire field in braces, no?

retorquere commented 8 years ago

And single-field names are also already enclosed in braces, right? So does this question then boil down to "don't caps-project name fields"?

njbart commented 8 years ago

But then why should one not just always do {Lastname}, {Firstname}?

To reduce clutter …

I already wrap and in braces (don't I?).

Not in publisher, location, etc. But if that’s easier, just wrap the entire field in braces.

So does this question then boil down to "don't caps-project name fields"?

It’s both about “don’t caps-protect anything except titles”, and “and-protect any literal ‘and’ biblatex might else parse as a literal-list separator”. Wrapping entire literal-list fields in braces would caps-protect them, too – which doesn’t hurt, but the aim is only to “and-protect” them.

retorquere commented 8 years ago

Ah, I indeed don't protect publisher, location, etc for ands. I didn't know they required that. But in any case, I'd only wrap the entire field if the user has selected the "all" version of preserve caps, not for inner.

Can you elaborate the etc when it comes to fields that require and protection?

WRT the preserveCaps, I currently have it on:

Which should go?

njbart commented 8 years ago

Can you elaborate the etc when it comes to fields that require and protection?

All biblatex literal list fields: institution, organization, publisher, location, origlocation, origpublisher, address and school (see biblatex manual, “2.3.4 Literal Lists”).

njbart commented 8 years ago

WRT the preserveCaps, I currently have it on: … Which should go?

That list is a bit of a mix of CSL and biblatex?! – I’d keep only those fields whose name contains “title”, (except journaltitle and journalsubtitle) plus “series”.

retorquere commented 8 years ago

Sorry, I have Zotero-internal names in the list (place and conferenceName). I've updated the list.

retorquere commented 8 years ago

So journal(sub)title doesn't get protection?

retorquere commented 8 years ago

So to be clear, author = {von Hicks, {III}, Michael}, should always be author = {von Hicks, III, Michael},?

retorquere commented 8 years ago

Instead of me just copy-pasting the lot here, could you look through https://travis-ci.org/ZotPlus/zotero-better-bibtex/jobs/85899312 to see if these are indeed all desired consequences?

retorquere commented 8 years ago

https://travis-ci.org/ZotPlus/zotero-better-bibtex/builds/85905734 has only the diffs with caps protection removed. Could you go through that? It's a pretty big behaviour change.

njbart commented 8 years ago

https://travis-ci.org/ZotPlus/zotero-better-bibtex/builds/85905734: Most of this looks good.

retorquere commented 8 years ago

So I'm getting back to this now that #385 is done. Looking over all these changes, if {X and Y}, {Firstname} is effectivley the same as X {and} Y, Firstname, that format would make name formatting dramatically easier. Is it the same? Because if so, that would have my preference at this point.

njbart commented 8 years ago

I’d say these two are equivalent [EDIT: no, they are not, see below] – but note that Firstname never needs curly braces, it’s just literal ands in ‘name lists’ and ‘literal lists’ that need protection – using X {and} Y.

Corporate authors and editors – i.e., Zotero’s single-field names – on the other hand must always be wrapped in an extra pair of curly braces to prevent data parsing from treating them as personal names which are to be dissected into their components.

retorquere commented 8 years ago

But is it safe to wrap firstnames in braces? Looking to simplify the output algorithm, and bracing it would automatically handle edge cases.

njbart commented 8 years ago

Actually, the two forms are not equivalent if two or more first names are wrapped in curly braces: in styles that abbreviate first names, author = {Doe, John Paul} is rendered as “Doe, J. P.” whereas author = {Doe, {John Paul}} is rendered (incorrectly!) as “Doe, J.”.

So first names must not be wrapped in curly braces.

What are the edge cases you are worried about?

retorquere commented 8 years ago

I'm trying to find out whether it is always safe to use author = {{Lastname}, Firstname, suffix and {Other}, Firstname}. I don't know of any sensible samples, but if suffixes or firstnames can plausibly include and or , this wouldn't work.

Still, this algorithm might work:

  1. Split name into Firstname, Lastname, and suffix (optional)
  2. Convert each to LaTeX
  3. Lastname = {Lastname}
  4. Firstname = Firstname.replace(/\band\b/, '{and}').replace(',', '{,}')
  5. suffix = suffix.replace(/\band\b/, '{and}').replace(',', '{,}')
  6. output {Lastname, Firstname, suffix}

Does that sound reasonable?

retorquere commented 8 years ago

I'm looking at your comment again; if I understand correctly, institution and publisher are name-ish fields, and the comment about title is really for #383.

retorquere commented 8 years ago

Would it be possible for you to assemble new test cases specifically for this issue? It would be cleaner than discussing the impact on existing cases. I'll deal with the existing cases when the cases specifically for this issue (and thus only relating to name-ish fields) pass.

njbart commented 8 years ago

“… name-ish …”

Sort of: From the biblatex manual: “The Biblatex package implements three distinct data types to handle bibliographic data: name lists, literal lists, and fields.” – institution, organization, publisher, location, origlocation, origpublisher, address and school are literal lists, so literal “and”s must be protected as {and}, but biblatex is not trying to parse literal list elements into first, last, etc.

“Would it be possible for you to assemble new test cases specifically for this issue?”

Yes, but I won’t be able to do much before the weekend.

Still, this algorithm might work:

My comments below refer to Zotero two-part name fields; single-part name fields should just be wrapped in extra curly braces, no other parsing required.

  1. Split name into Firstname, Lastname, and suffix (optional)
  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.

For biblatex: If any of the primary creators’ firstname fields in Zotero contains !,, add juniorcomma=true to the biblatex entry’s options field.

  1. Convert each to LaTeX

ok

  1. Lastname = {Lastname}

Why not like 4. firstname and 5. suffix?

+ 3a. Only if in Zotero the lastname is wrapped in double quotes, wrap bib(la)tex lastname in curly braces.

  1. Firstname = Firstname.replace(/\band\b/, '{and}').replace(',', '{,}')
  2. suffix = suffix.replace(/\band\b/, '{and}').replace(',', '{,}')

ok, but do the same for non-dropping particle and dropping particle

  1. output {Lastname, Firstname, suffix}

output {dropping-particle non-dropping-particle Lastname, suffix, Firstname}

For biblatex: If any of the primary creators’ non-dropping particles is non-empty, add useprefix=true to the biblatex entry’s options field.

For bibtex it might make sense to output {dropping-particle {non-dropping-particle Lastname}, suffix, Firstname}.

retorquere commented 8 years ago

If treating firstname like 4/5, that's perfectly fine by me.

So the algorithm for two-part names then becomes:

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX
  3. on all of those, brace and and ,
  4. Do not do any caps protection (OK so that's not a step, but let's just be clear on it not being a step)
  5. output {dropping-particle non-dropping-particle Lastname, suffix, Firstname}
  6. Profit

That is doable. I'll get to work on that.

For institutions etc, I'd prefer to have a separate issue and separate test cases.

retorquere commented 8 years ago

What should the algorithm do if the firstname is quoted? ["retorquere + nickbart1980"] [first, von] converts into

family: "retorquere + nickbart1980"
given: first
particle: von

Where those quotes are to be interpreted as "use literally"

["retorquere + nickbart1980"] [first, von] should output {von {retorquere + nickbart1980, first}, correct?

retorquere commented 8 years ago

Which would make the updated algorithm:

So the algorithm for two-part names then becomes:

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX
  3. If it was quoted, surround with braces, if not brace and and ,
  4. Do not do any caps protection (OK so that's not a step, but let's just be clear on it not being a step)
  5. output {dropping-particle non-dropping-particle Lastname, suffix, Firstname}
  6. Profit
njbart commented 8 years ago

Which would make the updated algorithm: …

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX
  3. If it was quoted, surround with braces, if not brace and and ,
  4. Do not do any caps protection (OK so that's not a step, but let's just be clear on it not being a step)
  5. output {dropping-particle non-dropping-particle Lastname, suffix, Firstname}
  6. Profit

Looks good. We’ll still need a few minor tweaks, e.g., when a particle ends with or -, we should probably add a \relax, as in author = {d’\relax Ormesson, Jean} (see Tame the BeaST, 13.4, “How to remove space between von and Last?”); and for bibtex only, we might want to protect lowercase elements inside last names (see TTB, 13.3, “How to get lowercase letters in the Last?”; not needed for biblatex, see here and here).

retorquere commented 8 years ago

OK, so:

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX
  3. Protect lowercase inside lastname if we're in BibTeX
  4. If it was quoted, surround with braces, if not brace and and ,
  5. Do not do any caps protection (OK so that's not a step, but let's just be clear on it not being a step)
  6. If non-dropping-particle ends in a space, don't change it; if it ends in a punctuation char, add \relax and a space, otherwise, add a space
  7. output {dropping-particle non-dropping-particleLastname, suffix, Firstname}
  8. Profit
retorquere commented 8 years ago

Before I forget, can you open a new issue for publisher, location etc?

retorquere commented 8 years ago

OK, tests are running on the above algorithm.

njbart commented 8 years ago
  1. If non-dropping-particle ends in a space, don't change it; if it ends in a punctuation char, add \relax and a space, otherwise, add a space

dropping-particle, too!

retorquere commented 8 years ago

Already included.

retorquere commented 8 years ago

What would be the right course of action for a two-part name where only the last name is supplied?

retorquere commented 8 years ago

(I'm going to assume you'd rather see {Lastname} than {Lastname,})

njbart commented 8 years ago

A single word will always be parsed as Last, so no comma is needed. – But two or more words would be parsed as First Last, so the comma might actually not be a bad idea. Need to investigate …

retorquere commented 8 years ago

Cool, easy to change. In the interim, this has a few \relax insertions of which I'd be interested whether they're OK this way.

njbart commented 8 years ago

No, I’ve been testing this, and the comma does not keep bibtex or biblatex from parsing {Foo Bar,} as Foo=First and Bar=Last. So it seems multipart last names need to be wrapped in braces, just like corporate names.

retorquere commented 8 years ago

What is a "multipart name"? Anything with whitespace? Anything with non-alphabetic characters? and this affects both last and firstnames? Do the particles always go outside the braces?

njbart commented 8 years ago

Anything with whitespace. This affects only lastnames. Particles should always go outside the braces, but OTOH, if there are particles, the braces aren’t even needed.

njbart commented 8 years ago

This looks good – but only for bibtex. Unfortunately, the \relax trick does not seem to work for biblatex. I’ll have a closer look …

retorquere commented 8 years ago

OK, then:

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX:
    1. If quoted, brace entire part
    2. If not, and it's a last or first name, and it contains a space, brace entire part
    3. If not, brace and and ,
      1. If in BibTeX, brace <space><lowercase letter><word boundary>
  3. Postfix the particles:
    1. If it ends in a space, leave it
    2. If it ends in a punctuation character
      1. add \relax<space> if we're in BibTeX
      2. add a space otherwise (this still under investigation)
    3. Otherwise, add a space
  4. Output {<dropping-particle><non-dropping-particle><Lastname>, <suffix>, <Firstname>}
  5. Profit
retorquere commented 8 years ago

tests are running on https://github.com/ZotPlus/zotero-better-bibtex/issues/384#issuecomment-152109856

retorquere commented 8 years ago

Changed to:

  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX:
    1. If quoted, brace entire part
    2. If not, and it's a last or first name, and it contains a space, brace entire part
    3. If not, brace and and ,
      1. If in BibTeX, brace <space><lowercase letter><word boundary>
  3. Postfix the particles:
    1. If it ends in a space, leave it
    2. If it ends in a punctuation character
      1. add \relax<space> if we're in BibTeX
      2. add <space> if the punctuation character is a period
      3. add nothing otherwise.
    3. Otherwise, add a space
  4. Output {<dropping-particle><non-dropping-particle><Lastname>, <suffix>, <Firstname>}
  5. Profit
retorquere commented 8 years ago

https://github.com/ZotPlus/zotero-better-bibtex/issues/384#issuecomment-152125210 passes all current tests, which means it has the same behavior as 1.6.2, except that empty last names don't generate a trailing comma. This probably means we don't have sufficient coverage; it would be a little surprising that the behavior was essentially OK as-is.

njbart commented 8 years ago
  1. ii. If not, and it's a last or first name, and it contains a space, brace entire part

Do not brace entire first names, or else abbreviation to initials won’t be correct.

retorquere commented 8 years ago
  1. Split name into non-dropping particle, lastname, firstname, dropping particle, and suffix.
  2. Convert each to LaTeX:
    1. If quoted, brace entire part
    2. If not, and it's a last name, and it contains a space, brace entire part
    3. If not, brace and and ,
      1. If in BibTeX, brace <spaces><lowercase letters><word boundary>
  3. Postfix the particles:
    1. If it ends in a space, leave it
    2. If it ends in a punctuation character
      1. add \relax<space> if we're in BibTeX
      2. add <space> if the punctuation character is a period
      3. add nothing otherwise.
    3. Otherwise, add a space
  4. Output {<dropping-particle><non-dropping-particle><Lastname>, <suffix>, <Firstname>}
  5. Profit

tests are running

retorquere commented 8 years ago

Tests have passed without changes to the test cases.

retorquere commented 8 years ago

Are you satisfied with the current implementation?

njbart commented 8 years ago

I haven’t lost sight of this, but am much too busy with other stuff. I’ll try to upload test cases as soon as I can. Anyway, with 1.6.3, I still get:

@book{vangogh,
  author = {van {Gogh}, {Vincent}},
  options = {useprefix}
}

@book{humboldt,
  author = {von {Humboldt}, {Alexander}}
}

@book{beauvoir,
  author = {de {Beauvoir}, {Simone}}
}

@book{degaulle,
  author = {{de Gaulle}, {Charles}}
}

@book{king,
  author = {King, {Jr}., {Martin} {Luther}}
}

I’d say none of the braces around any of the elements are necessary, except those around {de Gaulle} when Zotero’s lastname is protected by quotes: "de Gaulle".

retorquere commented 8 years ago

1.6.3 doesn't have these changes yet. All these changes are on a separate branch which I mean to merge as soon as I have decent confidence that it works as intended -- no rush, but I'm going to hold off until I have the test cases.

njbart commented 8 years ago

A few test cases: 8ZTVU26A

Expected biblatex output:

@book{vangogh,
  author = {van Gogh, Vincent},
  options = {useprefix=true}
}
@book{humboldt,
  author = {von Humboldt, Alexander}
}
@book{beauvoir,
  author = {de Beauvoir, Simone}
}
@book{degaulle,
  author = {{de Gaulle}, Charles}
}
@book{king,
  author = {King, Jr., Martin Luther}
}
@book{stevenson,
  author = {Stevenson, III, Adlai E.},
  options = {juniorcomma=true}
}
@book{nationalaeronauticsandspaceadministration,
  author = {{National Aeronautics and Space Administration}}
}
@book{bovendeert,
  author = {boven d' Eert, Christianus},
  options = {useprefix=true}
}
@book{s-gravesande,
  author = {'s- Gravesande, Goverdus},
  options = {useprefix=true}
}
@book{dequincey,
  author = {De Quincey, Thomas}
}
@book{ortegaygasset,
  author = {Ortega y Gasset, José}
}
@book{damato,
  author = {D’Amato, Alfonse}
}
@book{sadat,
  author = {el- Sadat, Anwar}
}
@book{lafollette,
  author = {La Follette, Sr., Robert M.}
}
@book{delamare,
  author = {de la Mare, Walter},
  options = {useprefix=true}
}
@book{degette,
  author = {DeGette, Diana}
}
@book{saunders,
  author = {Saunders, John Bertrand de Cusance Morant}
}
@book{marcusaurelius,
  author = {{Marcus Aurelius},}
}
@book{dumas,
  author = {Dumas, père, Alexandre}
}
@book{vanrensselaer,
  author = {Van Rensselaer, Stephen}
}
@book{lenfant,
  author = {L’Enfant, Pierre-Charles}
}
@book{vangulik,
  author = {{van Gulik}, Robert}
}
@book{sackville-west,
  author = {Sackville-West, Victoria}
}
@book{vaughanwilliams,
  author = {Vaughan Williams, Ralph}
}
@book{miesvanderrohe,
  author = {Mies van der Rohe, Ludwig}
}
@book{dalembert,
  author = {d’ Alembert, Jean le Rond},
  options = {useprefix=true}
}
@book{tocqueville,
  author = {de Tocqueville, Alexis}
}
@book{lafontaine,
  author = {de La Fontaine, Jean}
}
@book{lasalle,
  author = {de La Salle, René-Robert Cavelier}
}
@book{dupuydeclinchamps,
  author = {du Puy de Clinchamps, Philippe},
  options = {useprefix=true}
}
@book{stein,
  author = {vom und zum Stein, Heinrich Friedrich Karl}
}
@book{silva,
  author = {da Silva, Agostinho}
}
@book{dagama,
  author = {da Gama, Vasco},
  options = {useprefix=true}
}
@book{dannunzio,
  author = {D’Annunzio, Gabriele}
}
@book{daponte,
  author = {Da Ponte, Lorenzo}
}
@book{dellarobbia,
  author = {Della Robbia, Luca}
}
@book{este,
  author = {Este, Beatrice d’}
}
@book{medici,
  author = {Medici, Lorenzo de’}
}
@book{al-hakim,
  author = {al- Hakim, Tawfiq},
  options = {useprefix=true}
}
@book{levayer,
  author = {Le Vayer, François de La Mothe}
}

Expected bibtex output:

@book{vangogh,
  author = {{van Gogh}, Vincent},
}
@book{humboldt,
  author = {von Humboldt, Alexander}
}
@book{beauvoir,
  author = {de Beauvoir, Simone}
}
@book{degaulle,
  author = {{de Gaulle}, Charles}
}
@book{king,
  author = {King, Jr., Martin Luther}
}
@book{stevenson,
  author = {Stevenson, III, Adlai E.},
}
@book{nationalaeronauticsandspaceadministration,
  author = {{National Aeronautics and Space Administration}}
}
@book{bovendeert,
  author = {{boven d'Eert}, Christianus},
}
@book{s-gravesande,
  author = {'s-Gravesande, Goverdus},
}
@book{dequincey,
  author = {De Quincey, Thomas}
}
@book{ortegaygasset,
  author = {Ortega y Gasset, José}
}
@book{damato,
  author = {D'Amato, Alfonse}
}
@book{sadat,
  author = {el-Sadat, Anwar}
}
@book{lafollette,
  author = {La Follette, Sr., Robert M.}
}
@book{delamare,
  author = {{de la Mare}, Walter},
}
@book{degette,
  author = {DeGette, Diana}
}
@book{saunders,
  author = {Saunders, John Bertrand de Cusance Morant}
}
@book{marcusaurelius,
  author = {{Marcus Aurelius},}
}
@book{dumas,
  author = {Dumas, père, Alexandre}
}
@book{vanrensselaer,
  author = {Van Rensselaer, Stephen}
}
@book{lenfant,
  author = {L'Enfant, Pierre-Charles}
}
@book{vangulik,
  author = {{van Gulik}, Robert}
}
@book{sackville-west,
  author = {Sackville-West, Victoria}
}
@book{vaughanwilliams,
  author = {Vaughan Williams, Ralph}
}
@book{miesvanderrohe,
  author = {Mies van der Rohe, Ludwig}
}
@book{dalembert,
  author = {d'Alembert, Jean le Rond},
}
@book{tocqueville,
  author = {de Tocqueville, Alexis}
}
@book{lafontaine,
  author = {de La Fontaine, Jean}
}
@book{lasalle,
  author = {de La Salle, René-Robert Cavelier}
}
@book{dupuydeclinchamps,
  author = {{du Puy de Clinchamps}, Philippe},
}
@book{stein,
  author = {vom und zum Stein, Heinrich Friedrich Karl}
}
@book{silva,
  author = {da Silva, Agostinho}
}
@book{dagama,
  author = {{da Gama}, Vasco},
}
@book{dannunzio,
  author = {D'Annunzio, Gabriele}
}
@book{daponte,
  author = {Da Ponte, Lorenzo}
}
@book{dellarobbia,
  author = {Della Robbia, Luca}
}
@book{este,
  author = {Este, Beatrice d'}
}
@book{medici,
  author = {Medici, Lorenzo de'}
}
@book{al-hakim,
  author = {al-Hakim, Tawfiq},
}
@book{levayer,
  author = {Le Vayer, François de La Mothe}
}

(I’ve used dumb apostrophes for biblatex; not sure whether you also want to asciify Unicode chars such as é and ç.)

retorquere commented 8 years ago

Super. Most things pass, but some do not:

retorquere commented 8 years ago

There is something called a "zero width space" after the apostrophe that currently causes this; what does a ZWS mean there? Should I just ignore it?