jgm / pandoc-citeproc

Library and executable for using citeproc with pandoc
BSD 3-Clause "New" or "Revised" License
291 stars 61 forks source link

Formatting of Hispanic names #50

Closed davepwsmith closed 10 years ago

davepwsmith commented 10 years ago

Sorry to raise another issue that I suspect will be a flaw of CSL or Biblatex rather than this library, but seems to be related to parsing of names with dropping/non-dropping particles. A biblatex bibliography entry for poet Carlos Drummond de Andrade which is such:

@article{citekeyblah,
  pages = {1},
  title = {{No meio do caminho}},
  url = {http://www.brasiliana.usp.br},
  volume = {{1}},
  number = {3},
  journaltitle = {{Revista de Antropofagia}},
  author = {Drummond de Andrade, Carlos},
  urldate = {2014-05-19},
  date = {1928-07},
  year = {1928},
  file = {revista-de-antropofagia_1928_3.pdf:/Users/davidsmith/Zotero Library/storage/7ESFHIJE/revista-de-antropofagia_1928_3.pdf:application/pdf}
}

Should ideally display as: Drummond de Andrade, Carlos (1928), “No meio do caminho”, Revista de Antropofagia, vol. 1, no. 3, pp. 1, obtido de http://www.brasiliana.usp.br

but instead displays as: Andrade, Carlos Drummond de (1928), “No meio do caminho”, Revista de Antropofagia, vol. 1, no. 3, pp. 1, obtido de http://www.brasiliana.usp.br

Now I know that I can wrap the family names in braces thus: author = {{Drummond de Andrade}, Carlos},

and it will have the desired effect, but this seems to be a fairly poor solution for something that is not only a common, but practically a universal way of formatting names from hispanic/lusophone countries, especially since the biblatex format at least nominally supports comma separation of first/given names to specify this.

I'm afraid my knowledge of programming, and particularly haskell is pretty limited, but there seems to be a bit of bibtex.hs which takes the comma-separated strings and turns them into a single string. Perhaps a better behaviour would be to keep them separate, and so be able to distinguish between family names and given names as groups of words?

njbart commented 10 years ago

I'm afraid wrapping such family names in braces is the only option. pandoc-citeproc tries to stick to the published bibtex and biblatex specs as closely as possible, and the relevant details concerning name parsing are as follows:

von Last, First: The idea is similar, but identifying the First is easier: It’s everything after the comma. Before the comma, the last word is put in the Last (even if it starts with a lower case). If any other word begins with a lower case, anything from the first word to the last one starting with a lower case is in the von, and what remains is in the Last. (from http://www.lsv.ens-cachan.fr/~markey/BibTeX/doc/ttb_en.pdf)

The idea behind this rule is that, for example, De la in De la Fontaine, Jean is parsed correctly as a "von" part without having to add braces, but the disadvantage, of course, is that any family names containing lowercase words have to be wrapped in curly braces.

davepwsmith commented 10 years ago

Not sure that's quite right - in BibTeX that might be the case but BibLaTeX renders the citations correctly if you do: pandoc -S -s --latex-engine=xelatex --biblatex $file and then compile with latex to a pdf. The biblatex spec seems to have more fine grained control over this (see p. 116), so although I can't really work out how it distinguishes the elements, it definitely does so correctly.

So, without knowing how BibLaTeX does this, and having tried pretty unsuccessfully to learn some Haskell to decipher the code, I'm unable to offer a real solution, but I'm afraid the question remains...

njbart commented 10 years ago

Glad you checked, and it would seem that the otherwise reliable Tame the BeaST (http://www.lsv.ens-cachan.fr/~markey/BibTeX/doc/ttb_en.pdf) is wrong here, as is the even more detailed https://www.tug.org/TUGboat/tb27-2/tb87hufflen.pdf.

Though I remember the biblatex authors pointing out that name parsing was not changed in any way from bibtex, for Text::BibTeX (used by biber for parsing) the specs say, contradicting the documents above:

If a name has a single comma, then it is assumed to be in "von last, first" form. A leading sequence of tokens with initial lower-case letters, if any, forms the 'von' part; tokens between the 'von' and the comma form the 'last' part; tokens following the comma form the 'first' part. (http://search.cpan.org/~ambs/Text-BibTeX-0.69/lib/Text/BibTeX/Name.pm)

Definitely needs further examination.

njbart commented 10 years ago

Opened an issue on the biblatex tracker: https://github.com/plk/biblatex/issues/236

davepwsmith commented 10 years ago

Well, in that case hopefully common sense prevails - I only wish I could code well enough to contribute a pull-request rather than just complaining about things.

njbart commented 10 years ago

See plk's response on https://github.com/plk/biblatex/issues/236: "… my impression is that it's the "von" handling that constitutes the major difference [between bibtex and biblatex]".

@jgm: My feeling is that pandoc-citeproc should be updated to parse names exactly as biblatex/biber does (following http://search.cpan.org/~ambs/Text-BibTeX-0.69/lib/Text/BibTeX/Name.pm) when the database is in biblatex format whereas it should continue to parse names as it does now if the database format is bibtex (i.e., the database file has a .bibtex extension).

EDIT: … not so fast, maybe. There are other views coming in on https://github.com/plk/biblatex/issues/236, so let's see how this develops before changing anything.

jgm commented 10 years ago

@nickbart1980, the biblatex issue is closed. What was the final upshot?

njbart commented 10 years ago

Not much, really, except that we've learned that a few differences concerning name parsing exist between bibtex and biblatex. I take it that there's no intention of modifying biblatex to bring it in line with the bibtex specs/behaviour. If the goal is to have pandoc-citeproc behave exactly like the programs that define the standards (i.e., latex/bibtex for the bibtex format, and latex/biblatex/biber for the biblatex format), then pandoc-citeproc's parsing of the biblatex (but not the bibtex) format would have to be modified. You'll have to decide.

jgm commented 10 years ago

I've fixed this, I believe.

% pandoc-citeproc -y --format bibtex 50.bib
---
references:
- volume: '1'
  URL: http://www.brasiliana.usp.br
  page: '1'
  container-title: <span class="nocase">Revista de Antropofagia</span>
  author:
  - family: Andrade
    dropping-particle: Drummond de
    given: Carlos
  id: citekeyblah
  accessed:
    date-parts:
    - - 2014
      - 5
      - 19
  issued:
    date-parts:
    - - 1928
      - 7
  title: <span class="nocase">No meio do caminho</span>
  type: article-journal
  issue: '3'
...
%
% pandoc-citeproc -y --format biblatex 50.bib
---
references:
- volume: '1'
  URL: http://www.brasiliana.usp.br
  page: '1'
  container-title: <span class="nocase">Revista de Antropofagia</span>
  author:
  - family: Drummond de Andrade
    given: Carlos
  id: citekeyblah
  accessed:
    date-parts:
    - - 2014
      - 5
      - 19
  issued:
    date-parts:
    - - 1928
      - 7
  title: <span class="nocase">No meio do caminho</span>
  type: article-journal
  issue: '3'
...
gracile-fr commented 9 years ago

I'm a bit surprised by the example (Carlos Drummond de Andrade). I thought that, for Portuguese and Lusophone names, only the last surname had to be used for sorting, bibliography entries, etc. http://www.loc.gov/books/?q=carlos+drummond+de+andrade&all=true http://www.idref.fr/028032470

njbart commented 9 years ago

I don’t know much about Portuguese names but I’ll note that his Portuguese Wikipedia entry (https://pt.wikipedia.org/wiki/Carlos_Drummond_de_Andrade, as well as his English and French Wikipedia entries) use “Drummond” as a short form throughout.

So, though we can’t shorten “Drummond de Andrade” to “Drummond” for in-text citations within the current CSL framework – the CSL schema would have to introduce no less an additional dropping-suffix name part for that, I guess –, my impression is that “Drummond de Andrade” is probably acceptable, and certainly better than “Andrade” in this case.

Still, if any native speakers of Portuguese could comment, this would certainly be helpful.

davepwsmith commented 9 years ago

Not a native, but I stand by my original post! This was perhaps a poor choice of example -- many bibliographies would list as "Andrade, CD de" or something similar, although there is no hard and fast Rule. Portuguese names are a bit of a bibliographical nightmare, and people frequently choose which of their (many) surnames that they wish to use on an ad-hoc basis -- this is why many alphabetized lists of names in Portuguese are organized by given names first, followed by whatever surnames have been provided. Added to this is the fact that many surnames and given names are incredibly common -- much more so than in English. The Wikipedia article you cite is one example of how people get around the problem -- Drummond, as the least common of his names, is how he would probably have referred to himself and been referred to by his contemporaries. However, the following list, which I put on the bibtex thread has at least one example which would cause the parser as it was to choke incorrectly (the third one):

{Carlos} {Drummond de Andrade} {João Cabral} de {Melo Neto} {María del Carmen} {Pucci} {João Guimarães} {Rosa} {Haroldo} de {Campos}

In spite of all of this, I would say that the main point of my argument is that if I wish to alphatize my bibliography in x manner, and I have clearly indicated this with commas in the biblatex file, then my parser probably shouldn't try to second guess me!

Anyhow, this bug has been satisfactorily squashed, so I don't really see what the fuss is about.

gracile-fr commented 9 years ago

No "fuss" at all. And I'm not willing to change anything. I was just curious of the example chosen to illustrate "Hispanic names" and since I assume that you've obviously more knowledge than me on this subject, I posted here. Anyway, I think that CSL should be able to handle that in the future (i.e. to let the user choose how to parse these names) but that's not the right place to discuss it.