gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Parsing the dagger symbol? #85

Closed dimus closed 2 years ago

dimus commented 3 years ago

created by @gdower at https://gitlab.com/gogna/gnparser/-/issues/85

Names often include the dagger symbol (†) to indicate that the taxon is extinct. It might be useful to remove the dagger from the name and add an extinct boolean.

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

1. Henriksenopterix†

2. Henriksenopterix† paucistriata (Henriksen, 1922)

3. Heteralocha acutirostris (Gould, 1837) Huia N E†

4. Oncorhynchus nerka (Walbaum, 1792) Sockeye salmon F A †? 

5. Ostomalynus Kireichuk & Ponomarenko, 1990. Type 
   species: †  Ostomalynus ovalis Kireichuk & 
   Ponomarenko, 1990, by original designation.

Cases 1-3: pos will work fine if to substitute the dagger with a space.

Case 4-5: This one is problematic. I guess what I can do is to remember where daggers happened, and if all of them were in the unparsed tail -- ignore them.

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

@gdower do you have examples of where do you see the dagger symbol in the wild? If it is always in the end, pos part of the parsed data will not get broken.

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/45

It does make sense. I can imagine 2 ways to solve it.

  1. To have a preprocessing that detects and removes the dagger symbol. This approach has, in my view, 2 problems:

    • It would run for every string, while dagger symbol is pretty rare. If the implementation is a regex it will take 1-2% of speed. However if it is done by scanning every symbol, slowdown will be negligible.
    • It will modify the name. However we do change it for example when we remove html tags, and altogether we do normalize name anyway.
  2. If we have an unparsed tail, we scan it for the dagger symbol. We keep the dagger in the unparsed tail and set extinct flag to true. In this case search for the dagger will be usually rare. Possible problems:

    • Name in this case is marked as quality 3, while dagger symbol is a commonly accepted practice.

I think the first approach is better. After looking at "dagger" names in the wild 2nd approach is not going to work at all.

dimus commented 2 years ago

Solution:

  1. Dagger is detected during preprocessing, and substituted with 3 spaces (to keep the same number of bytes: 0xE2 0x80 0xA0 (e280a0))
  2. flag HasDagger is set to true
  3. Parsing as usual

Such approach generates a warning for too many empty spaces, and we cannot say if it was generated because of the dagger char, or because there were genunine spare empty spaces as well.

Solution: remove empty spaces silently. I think removal of extra spaces is similar to removal of comma before year, it is something that probably can be done without issuing a warning.