Closed dimus closed 2 years ago
created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43
1. Henriksenopterix†
2. Henriksenopterix† paucistriata (Henriksen, 1922)
3. Heteralocha acutirostris (Gould, 1837) Huia N E†
4. Oncorhynchus nerka (Walbaum, 1792) Sockeye salmon F A †?
5. Ostomalynus Kireichuk & Ponomarenko, 1990. Type
species: † Ostomalynus ovalis Kireichuk &
Ponomarenko, 1990, by original designation.
Cases 1-3: pos
will work fine if to substitute the dagger with a space.
Case 4-5: This one is problematic. I guess what I can do is to remember where daggers happened, and if all of them were in the unparsed tail -- ignore them.
created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44
@gdower do you have examples of where do you see the dagger symbol in the wild? If it is always in the end, pos
part of the parsed data will not get broken.
created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/45
It does make sense. I can imagine 2 ways to solve it.
To have a preprocessing that detects and removes the dagger symbol. This approach has, in my view, 2 problems:
If we have an unparsed tail, we scan it for the dagger symbol. We keep the dagger in the unparsed tail and set extinct
flag to true. In this case search for the dagger will be usually rare. Possible problems:
I think the first approach is better. After looking at "dagger" names in the wild 2nd approach is not going to work at all.
Solution:
0xE2 0x80 0xA0 (e280a0)
)true
Such approach generates a warning for too many empty spaces, and we cannot say if it was generated because of the dagger char, or because there were genunine spare empty spaces as well.
Solution: remove empty spaces silently. I think removal of extra spaces is similar to removal of comma before year, it is something that probably can be done without issuing a warning.
created by @gdower at https://gitlab.com/gogna/gnparser/-/issues/85
Names often include the dagger symbol (†) to indicate that the taxon is extinct. It might be useful to remove the dagger from the name and add an extinct boolean.