gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Why `’` is converted to `'`? #247

Closed dimus closed 1 year ago

dimus commented 1 year ago

From https://github.com/gnames/gnparser/issues/245

Another issue is that "D'Orbigny" in the original is "D’Orbigny" in the gnparser output. Why change UTF-8 27 to e2 80 99?

dimus commented 1 year ago

I do try to normalize/simplify characters if it does not change semantic meaning. My impression is that ' and are used interchangeably for authors in scientific names, and I picked ' because it is ASCII, meaning it will generate less problems for people with weird default encoding.

The original spelling of the authorship is preserved in JSON format in the verbatim field:

"authorship": {
    "verbatim": "B.D’Orbigny",
    "normalized": "B. D' Orbigny",
    "authors": [
      "B. D' Orbigny"
    ],
    "originalAuth": {
      "authors": [
        "B. D' Orbigny"
      ]
    }
  },

It might make sense to leave verbatim authorship in csv/tsv output, let me think about it a bit.

Mesibov commented 1 year ago

@dimus, I've rechecked the original dataset and found that the compilers used both characters: 3 records Acteocina candei (D’Orbigny, 1841) 37 records Acteocina candei (D'Orbigny, 1842)

gnparser converted both to apostrophe in Author, which is OK. I was looking at "D’Orbigny" in the verbatim field and thinking I had inputted "D'Orbigny", so my mistake, all is well. In my pseudo-duplicate search the results are fine:

Acteocina candei (D’Orbigny, 1841) [3] Acteocina candei (D'Orbigny, 1842) [37]