Closed jacquerie closed 7 years ago
@tsgit at some point produced an exhaustive lists of unicode dashes, that I guess we should support in general.
so here is old email
There are multiple forms of unicode hyphens, e.g.
U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2014 EM DASH
U+2015 HORIZONTAL BAR
and more obscure things like
U+058A ARMENIAN HYPHEN
U+05BE HEBREW PUNCTUATION MAQAF
However for many practical concerns, like reference linking and citation counting it is important that only the ASCII hyphen-minus is being used. I fixed some 440 records today which had an EN-DASH in the page-range in 773__c, e.g.
Changed field 773__c from '489–510' to '489-510'
Congratulations if you can spot the difference on your display with your choice of font.
I created a bibcheck rule for replacement of the most frequent offender -- en-dash -- in page-ranges, see https://github.com/inspirehep/inspire/pull/174 however this problem goes beyond just page-range. There are other fields in 773 with en-dash in them
https://inspirehep.net/search?p=773%3A*%E2%80%93*
and many other MARC tags where the same applies.
What's labs doing about either normalizing such fields or defining character equivalence classes in lookups?
The unicode tables themselves are useful, and so is the link you dug out. I particularly like the "See Also" feature at fileformat.info, e.g.
http://www.fileformat.info/info/unicode/char/2d/index.htm similarly for apostrophe http://www.fileformat.info/info/unicode/char/0027/index.htm and space http://www.fileformat.info/info/unicode/char/0020/index.htm
there are categories http://www.fileformat.info/info/unicode/category/index.htm e.g. http://www.fileformat.info/info/unicode/category/Pd/list.htm http://www.fileformat.info/info/unicode/category/Zs/list.htm
interestingly the "Hyphen Bullet"
http://www.fileformat.info/info/unicode/char/2043/index.htm
is in category Punctuation Other
, not in Punctuation Dash
http://www.fileformat.info/info/unicode/category/Po/list.htm
In French, lists are traditionally done with dashes instead of bullets, I guess that's the proper unicode character for it.
What's labs doing about either normalizing such fields or defining character equivalence classes in lookups?
On labs, no field should contain dashes as a range separator. Instead, fields have been split into start and end of range (e.g. https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.yml#L1145-L1154 for the publication note). So the handling of dashes has to happen when writing into the record.
What about using unidecode
here + post-processing for stripping repeated dashes? artid
should be ascii AFAIK.
In [1]: from unidecode import unidecode
In [2]: dashes = (u'\u002d', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014', u'\u2015', u'\u058a', u'\u05be', u'\u2043')
In [3]: [unidecode(dash) for dash in dashes]
Out[3]: ['-', '-', '-', '-', '-', '--', '--', '-', '', '--']
u+05be looks like a bug in unidecode
. I sent a PR in https://github.com/avian2/unidecode/pull/12.
This issue was moved to inspirehep/inspire-schemas#212
Expected Behavior
split_page_artid
should handle unicode dashes like\u2013
and\u2010
.See: https://github.com/inspirehep/inspire-next/pull/2410#issuecomment-306735664