schemedoc / bibliography

Bibliography of Scheme research (readscheme.org and beyond)
https://research.scheme.org
145 stars 20 forks source link

Empirical solution to name representation #18

Open lassik opened 1 year ago

lassik commented 1 year ago

The command:

grep -h '^(author ' page*.scm | sort | uniq | sed -e 's/^(author //' -e 's/)$//' -e 's/"//g' | grep -v others

gives all the names we have so far:

Adams, Norman
Anderson, Claude W
Anderson, Kenneth R
Ashley, J Michael
Baker, Henry G
Bartlett, Joel F
Bartley, David H
Barzilay, Eli
Bawden, Alan
Başar, R Emre
Benson Jr, Brent W
Bothner, Per
Boucher, Dominique
Boussinot, Frédéric
Bres, Yannis
Bruggeman, Carl
Carlstrom, Brian David
Cejtin, Henry
Chen, Pee-Hong
Ciabrini, Damien
Clements, John
Clinger, Will
Clinger, William D
Clinger, William
Cowley, Anthony
Danvy, Olivier
De Roure, David
DePristo, Mark
DeRoure, David
Derici, Caner
Desbien, Jocelyn
Dionne, Carl
Duba, Bruce F
Dwyer, Rex A
Dybvig, R Kent
Earl, Christopher
Epardaud, Stéphane
Farmer, William M
Feeley, Marc
Felleisen, Matthias
Findler, Robert Bruce
Flanagan, Cormac
Flatt, Matthew
Forin, Alessandro
Foster, Ian
Friedman, Daniel P
Friedman, Daniel P.
Fuchs, Matthew
Gasbichler, Martin
Germain, Guillaume
Ghuloum, Abdulaziz
Grossman, Dan
Guttman, Joshua D
Halstead Jr, Robert H
Hansen, Lars Thomas
Hanson, Chris
Hartheimer, Anne
Haynes, Christopher T
Haynes, Christopher T.
Hickey, Timothy J
Hieb, Robert
Hilsdale, Erik
Hudak, Paul
Jagannathan, Suresh
Jensen, John C
Katz, Morry
Keep, Andrew W
Kelsey, Richard A
Kelsey, Richard
Kimball, Aaron
Kranz, David A
Kranz, David
Krishnamurthi, Shriram
Lang, Kevin J
Lapalme, Guy
Loaiza, Juan R
Loitsch, Florian
Marshall, Joe
Masuhara, Hidehiko
McDermott, Drew
Meunier, Philippe
Might, Matthew
Miller, James S
Miller, James
Miller, Scott G
Mirani, Rajiv
Mohr, Eric
Monk, Leonard G
Monnier, Stefan
Moreau, Luc
Muller, Hans
Nagata, Akihito
Norvig, Peter
Oliva, Dino P
Ost, Eric
Pearlmutter, Barak A
Pettyjohn, Greg
Philbin, James
Philbin, Jim
Piquer, José
Piérard, Adrien
Pleban, Uwe F
Pleban, Uwe F.
Prabhu, Tarun
PreScheme, Multithreaded
Queinnec, Christian
Ramsdell, John D
Rees, Jonathan A
Rees, Jonathan
Ribbens, Daniel
Rose, John R
Rozas, Guillermo J
Rozas, Guillermo
Sabry, Amr Afaf
Sabry, Amr
Sarkar, Dipanwita
Schooler, Richard
Schultz, Ulrik P
Schultz, Ulrik Pagh
Serpette, Bernard P
Serpette, Bernard Paul
Serpette, Bernard
Serrano, Manuel
Shivers, Olin
Sperber, Michael
Stamos, James W
Steele Jr, Guy L
Steele Jr, Guy Lewis
Steele, Guy L
Sumii, Eijiro
Sussman, Gerald Jay
Swarup, Vipin
Tammet, Tanel
Taura, Kenjiro
Taylor, CJ
Teodosiu, Dan
Thanos, Dimitri
Thiemann, Peter
Tinker, Pete
Turcotte, Marcel
Van Horn, David
Vegdahl, Steven R
Vitek, Jan
Waddell, Oscar
Wand, Mitchell
Weeks, Stephen
Weis, Pierre
Weise, Daniel
Wilson, Jason
Wittenberger, J
Yonezawa, Akinori
Şenol, Çağdaş
lassik commented 1 year ago

Seesm that most of these are easy European-style names.

These have Jr:

Benson Jr, Brent W
Halstead Jr, Robert H
Steele Jr, Guy L
Steele Jr, Guy Lewis

These seem like Japanese names:

Sumii, Eijiro
Taura, Kenjiro

Some authors' names are spelled differently in different papers. I'm not sure whether we should preserve this.

lassik commented 1 year ago

Here's what CSL-JSON expects:

  "definitions": {
    "name-variable": {
      "anyOf": [
        {
          "type": "object",
          "properties": {
            "family": {
              "type": "string"
            },
            "given": {
              "type": "string"
            },
            "dropping-particle": {
              "type": "string"
            },
            "non-dropping-particle": {
              "type": "string"
            },
            "suffix": {
              "type": "string"
            },
            "comma-suffix": {
              "type": ["string", "number", "boolean"]
            },
            "static-ordering": {
              "type": ["string", "number", "boolean"]
            },
            "literal": {
              "type": "string"
            },
            "parse-names": {
              "type": ["string", "number", "boolean"]
            }
          },
          "additionalProperties": false
        }
      ]
    },
lassik commented 1 year ago

And what CFF expects:

"person": {
    "additionalProperties": false,
    "description": "A person.",
    "properties": {
        ...
        "family-names": {
            "description": "The person's family names.",
            "minLength": 1,
            "type": "string"
        },
        ...
        "given-names": {
            "description": "The person's given names.",
            "minLength": 1,
            "type": "string"
        },
        "name-particle": {
            "description": "The person's name particle, e.g., a nobiliary particle or a preposition meaning 'of' or 'from' (for example 'von' in 'Alexander von Humboldt').",
            "examples": [
                "von"
            ],
            "minLength": 1,
            "type": "string"
        },
        "name-suffix": {
            "description": "The person's name-suffix, e.g. 'Jr.' for Sammy Davis Jr. or 'III' for Frank Edwin Wright III.",
            "examples": [
                "Jr.",
                "III"
            ],
            "minLength": 1,
            "type": "string"
        },
        ...

The CFF person record supports other interesting data (e.g. website) that is not strictly related to names.

lassik commented 1 year ago

@omasanori Would you like to try synthesizing from these schemas and the BibTeX format a name representation that works for the names listed above?

omasanori commented 1 year ago

Yeah, I will try. Thank you so much for your survey, @lassik !

comma-suffix

Wow, CSL could distinguish "John Doe, Jr." and "John Doe Jr."

Some authors' names are spelled differently in different papers. I'm not sure whether we should preserve this.

It is probably fine to unify "Friedman, Daniel P" and "Friedman, Daniel P." into one "Daniel P. Friedman", for instance. The situation were awful if we had found "J. McCarthy" since that person could at least be John McCarthy or Jay A. McCarthy in the context of Lisp dialects. In general, if we are confident we can unify but otherwise we should keep as-is.

omasanori commented 1 year ago

On (probably) Japanese names, I found five:

They all follow the Family, Given format so the sorting is okay.

lassik commented 1 year ago

I wonder how Van Horn, David is represented. Is "Van Horn" the surname?

lassik commented 1 year ago

How do these look:

"Benson Jr, Brent W"

(family "Benson")
(given "Brent" "W")
(suffix "Jr")
"Halstead Jr, Robert H"

(family "Halstead")
(given "Robert" "H")
(suffix "Jr")
"Steele Jr, Guy L"

(family "Steele")
(given "Guy" "L")
(suffix "Jr")
omasanori commented 1 year ago

Van is... difficult. In some countries, Van shall be ignored as the sorting key, while in other countries Van shall be counted. Whether it is capitalized or not also depends on countries or languages (or usage).

Regarding David Van Horn, David always uses capitalized form and BibTeX normally counts capitalized token as part of surname, so, I guess that it is not awfully bad to treat Van Horn as the surname.

omasanori commented 1 year ago

And David does not spell "David V. Horn" so let's keep Van as-is. In most case, their own usages matter.

lassik commented 1 year ago

Yes, people are the best authority on their own names.

If the default sort key is the family name, then the following would suffice.

(family "Van Horn")
(given "David")

This means that "Horn" never makes sense without the "Van" prefix; the name is always filed under "Van Horn".

There is another schemer, Anton van Straaten, who has at least one paper in the bibliography (not yet converted to S-expression metadata). In his name, the "van" is in lowercase. So I don't know whether it's "Straaten, Anton van" or "van Straaten, Anton" (and in the latter case, it could be alphabetized under "v" or "s" - who knows.)

omasanori commented 1 year ago

In BibTeX, the letter one is preferred, as van is a prefix of family name and the sorting ignores it (the von part in the BibTeX terminology) anyways.

In CSL terminology, that (ignored in sorting) van is a dropping-particle or a non-dropping-particle. Dropping is whether it should be dropped when family name is displayed alone in, ex. "For details, see [Name, 2023]" vs. "For details, see [van Name, 2023]".