Closed gdower closed 2 years ago
Thank you @gdower, this is a good catch. I think it does make sense to add non-breaking hythen, as it is something we know appears 'in the wild', while I would postpone other hyphens until they are encountered for real to save some CPU cycles.
Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:
https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on
Perhaps atypical hyphens should be standardized to U+002D hyphens prior to parsing?
Here's the PDF although they don't put the non-breaking hyphens in the web version.
Here's some other atypical hyphens that might also occasionally be an issue introduced by publishers or bad OCR:
https://www.fileformat.info/info/unicode/category/Pd/list.htm
If it hurts performance too much, it's probably okay to not bother with handling it. It's not a frequently encountered problem.