As a user I want atypical hyphens standardized and parsed

gdower commented 2 years ago

Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:

https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on

Perhaps atypical hyphens should be standardized to U+002D hyphens prior to parsing?

Here's the PDF although they don't put the non-breaking hyphens in the web version.

Here's some other atypical hyphens that might also occasionally be an issue introduced by publishers or bad OCR:

https://www.fileformat.info/info/unicode/category/Pd/list.htm

If it hurts performance too much, it's probably okay to not bother with handling it. It's not a frequently encountered problem.

dimus commented 2 years ago

Thank you @gdower, this is a good catch. I think it does make sense to add non-breaking hythen, as it is something we know appears 'in the wild', while I would postpone other hyphens until they are encountered for real to save some CPU cycles.

dimus commented 2 years ago

Hopefully

https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on

Now parses correctly

gnames / gnparser

As a user I want atypical hyphens standardized and parsed #237