inspirehep / invenio

Invenio digital library software, INSPIRE OPS version
http://invenio-software.org/
GNU General Public License v2.0
3 stars 10 forks source link

apostrophes etc. in author part of texkey #352

Closed tsgit closed 7 years ago

tsgit commented 7 years ago

texkey generator author part:

https://github.com/inspirehep/invenio/blob/prod/modules/miscutil/lib/sequtils_texkey.py#L180

latest ticket: https://rt.inspirehep.net/Ticket/Display.html?id=615287

see

https://rt.inspirehep.net/Ticket/Display.html?id=193444 https://inspirert.cern.ch/Ticket/Display.html?id=400774

for reference.

you say "alphabetic", I found "alphanumeric" (in the C locale) as the closest spelled out recommendation. There is no specific spec, the original bibtex implementation is the reference.

anyhow, currently unicode is transliterated to ascii but no further data cleaning is performed on the author or collaboration name. going forward we could simply strip all non alphanumeric characters, but I am hesitant to retroactively alter existing texkeys

T

tsgit commented 7 years ago

what about hyphens in names (there are texkeys with hyphens in Inspire)? what about other punctuation? what restrictions are various new tools imposing which the original bibtex doesn't?

tsgit commented 7 years ago

elsevier uses apostrophes in bibtex keys http://www.sciencedirect.com/science/article/pii/S0550321316300475

@article{'tHooft20164,
title = "Reflections on the renormalization procedure for gauge theories ",
journal = "Nuclear Physics B ",
...

while APS uses (partial) article id

@article{PhysRevD.14.3432,
  title = {Computation of the quantum effects due to a four-dimensional pseudoparticle},
  author = {'t Hooft, G.},
  journal = {Phys. Rev. D},
  volume = {14},
  issue = {12},
  pages = {3432--3450},
...

and World Scientific uses DOI

@article{doi:10.1142/S0217751X16300222,
author = {’t Hooft, Gerard},
title = {Imagining the future, or how the Standard Model may survive the attacks},
...

so clearly acceptable punctuation is :/. and for some ' also.

tsgit commented 7 years ago

@michamos @aw-bib @hoc3426 consensus on the call today seemed to be [-A-Za-z0-9] with at least one [A-z]. I think I will also allow :/. any other thoughts?

michamos commented 7 years ago

Why don't we simply strip all non-alphanumeric characters? after all, we want to generate a compact identifier that is somewhat related to the paper, so I would say the less extra characters the better, as long as we can get a non-empty identifier. Does "Smith-O'Neill" really care if it becomes 'SmithONeill' or 'Smith-ONeill'? and do :/. ever appear in real last/collaboration names?

michamos commented 7 years ago

@plk (biber author) might be interested in chiming in.

For context: we are deciding what characters to authorize in the BibTeX keys generated by INSPIRE. Currently, they are based on first author or collaboration names in the database (e.g. Smith:2016bw), without any sanitizing except for conversion to ascii for compatibility with old tools. biber and bibdesk seem to choke on ', our users are complaining, so it might be a good idea to restrict the characters we put there.

plk commented 7 years ago

biber supports UTF-8 in keys but there are certain characters which are forbidden by the btpasrse C library which biber uses to parse .bib files. You can see the characters allowed by the default btparse (which only supported ASCII) here: http://search.cpan.org/~gward/btparse-0.34/doc/btparse.pod

In addition, biber allows arbitrary UTF-8 in keys, as mentioned, but there are characters not allowed by the library due to parsing restrictions.

michamos commented 7 years ago

Thanks for the input!

From http://search.cpan.org/~gward/btparse-0.34/doc/btparse.pod, in order to be btparse-compatible (which is used both by bibdesk and biber, and maybe other tools we don't know of), our texkeys should be

A string of letters, digits, and the following characters:

      ! $ & * + - . / : ; < > ? [ ] ^ _ ` |
tsgit commented 7 years ago

Hi Micha @michamos

thanks for pulling in the biber developer to the discussion on github. That provided some good perspective.

It's an odd thing to exclude specifically the apostrophe because of limitations of a third party parsing library used in some popular tools -- but for the sake of interoperability I go along with that.

For most practical purposes simple alphanum appears sufficient at first, and it's easy to implement. I do however think that changing long established practices warrants a somewhat more thorough analysis. As I see it bibtex keys should be

1) mnemonic 2) easy to type and read in (La-)TeX documents 3) interoperable with reference management tools 4) reasonably unique 5) semi-persistent

The mnemonic part is very useful during the authoring of a paper, since it can or should be immediately obvious what work a particular \cite{} refers to while proof-reading, etc.. Author name and year serve that purpose well. If different services use different conventions to "normalize" author names this can become confusing and cumbersome to the user. Was it

Meyer-ter-Vehn or MeyerterVehn

O'Connell or Oconnell or OConnell

etc.?

What about Collaboration acronyms like L^3 (L-cube) or Lux-Zepplin or similar.

Also, Inspire is setting an example in the HEP community and people develop tools around our services. So if almost all our texkeys are plain alphanum, then that creates an expectation. If we later on decide to use DOIs for texkeys for records where no author or collaboration name is available (like World Scientific does), then suddenly a set of additional symbols must be allowed (DOIs can have all kinds of crazy things in them, they are woefully permissive). This might break those tools. So instead of restricting the character set I prefer to allow the widest one which is interoperable.

Anyhow, I minted 1,189,740 texkeys in HEP on my test server with

re.sub(r'[^-A-Za-z0-9.:/]', '', texkey_first_part)

and then also with the character set you recommended (minus "[" and "]")

re.sub(r'[^-A-Za-z0-9.:/^_;&*<>?|!$+]', '', texkey_first_part)

and looked at differences. This is the complete list and most need metadata correction:

(recid, 'texkey')

(334296, 'OfficeofSmall&DisadvantagedBusinessUtilization:1992djv') (448769, 'R&DMagazine:1997vgy') (876377, 'Ferguson*:1999fkm') (924304, 'MasonJohnston&Assoc.:1990fzj') (1124896, 'Br?ning:2012mds') (1125124, 'Br?ning:2012ikp') (1125152, 'Chanc?:2012vdd') (1125350, 'Lombra?aGonz?lez:2012kct') (1125428, 'Chanc?:2012odq') (1126789, 'H?lsmann:2012mhf') (1126805, 'M?tral:2012dpm') (1126834, 'Dom?nguez:2012xsz') (1126878, 'M?tral:2012lve') (1182696, 'Gr?wer:2011tju') (1183027, 'Chanc?:2011qpw') (1183028, 'Chanc?:2011ize') (1183045, 'Bra?as:2011ueq') (1183214, 'Dom?nguez:2011sqk') (1183215, 'Dom?nguez:2011izm') (1183278, 'Resta-L?pez:2011unr') (1183412, 'M?ller:2011xye') (1183415, 'H?lsmann:2011bry') (1183509, 'H?fle:2011xiy') (1183525, 'G?nzel:2011edx') (1183526, 'G?nzel:2011gon') (1183527, 'G?nzel:2011mgu') (1183590, 'Touz?:2011jxj') (1183731, 'Verd?-Andr?s:2011nay') (1187061, 'H?fle:2011tms') (1187123, 'H?fle:2011jjp') (1187345, 'Garc?a:2011esn') (1301953, 'H?ivnacova:2014cen') (1316182, 'Jastrz?bski:2014upp') (1332251, 'Djuri$c:2014jci') (1420937, 'Ola!H:2015twe') (1423142, 'Gurlebeck>:2015ikw') (1427812, '$AA$berg:2009zwk') (1431395, 'Kozl;owski:1967wlk') (1439147, 'Hornsh?j:1971iso') (1453295, 'Schr?der:1980ppg') (1481365, 'Xiangdong&nbspZhang:2016ovs') (1487004, 'Giulia&nbspSchettino:2016cze') (1487006, 'Eyo&nbspEyo&nbspIta:2016fcc') (1487007, 'Elias&nbspZafiris:2016nsg') (1487008, 'Jorge&nbspL.&nbspCervantes-Cota:2016aan')

in some sense "Elias&nbspZafiris:2016nsg" is better than "EliasnbspZafiris:2016mrz" because it shows the reason more clearly   shouldn't be in the author field.

So in the end my pull request has

re.sub(r'[^-A-Za-z0-9.:/^_;&*<>?|!$+]', '', texkey_first_part)

(that's after unicode transliteration)

Cheers T.