inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

utils: split_page_artid should handle unicode dashes #2412

Closed jacquerie closed 7 years ago

jacquerie commented 7 years ago

Expected Behavior

split_page_artid should handle unicode dashes like \u2013 and \u2010.

See: https://github.com/inspirehep/inspire-next/pull/2410#issuecomment-306735664

kaplun commented 7 years ago

@tsgit at some point produced an exhaustive lists of unicode dashes, that I guess we should support in general.

tsgit commented 7 years ago

so here is old email

There are multiple forms of unicode hyphens, e.g.

U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2014 EM DASH
U+2015 HORIZONTAL BAR

and more obscure things like

U+058A ARMENIAN HYPHEN
U+05BE HEBREW PUNCTUATION MAQAF

However for many practical concerns, like reference linking and citation counting it is important that only the ASCII hyphen-minus is being used. I fixed some 440 records today which had an EN-DASH in the page-range in 773__c, e.g.

Changed field 773__c from '489–510' to '489-510'

Congratulations if you can spot the difference on your display with your choice of font.

I created a bibcheck rule for replacement of the most frequent offender -- en-dash -- in page-ranges, see https://github.com/inspirehep/inspire/pull/174 however this problem goes beyond just page-range. There are other fields in 773 with en-dash in them

https://inspirehep.net/search?p=773%3A*%E2%80%93*

and many other MARC tags where the same applies.

What's labs doing about either normalizing such fields or defining character equivalence classes in lookups?

tsgit commented 7 years ago

The unicode tables themselves are useful, and so is the link you dug out. I particularly like the "See Also" feature at fileformat.info, e.g.

http://www.fileformat.info/info/unicode/char/2d/index.htm similarly for apostrophe http://www.fileformat.info/info/unicode/char/0027/index.htm and space http://www.fileformat.info/info/unicode/char/0020/index.htm

there are categories http://www.fileformat.info/info/unicode/category/index.htm e.g. http://www.fileformat.info/info/unicode/category/Pd/list.htm http://www.fileformat.info/info/unicode/category/Zs/list.htm

tsgit commented 7 years ago

interestingly the "Hyphen Bullet"

http://www.fileformat.info/info/unicode/char/2043/index.htm

is in category Punctuation Other, not in Punctuation Dash

http://www.fileformat.info/info/unicode/category/Po/list.htm

michamos commented 7 years ago

In French, lists are traditionally done with dashes instead of bullets, I guess that's the proper unicode character for it.

What's labs doing about either normalizing such fields or defining character equivalence classes in lookups?

On labs, no field should contain dashes as a range separator. Instead, fields have been split into start and end of range (e.g. https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.yml#L1145-L1154 for the publication note). So the handling of dashes has to happen when writing into the record.

michamos commented 7 years ago

What about using unidecode here + post-processing for stripping repeated dashes? artid should be ascii AFAIK.

In [1]: from unidecode import unidecode

In [2]: dashes = (u'\u002d', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014', u'\u2015', u'\u058a', u'\u05be', u'\u2043')

In [3]: [unidecode(dash) for dash in dashes]
Out[3]: ['-', '-', '-', '-', '-', '--', '--', '-', '', '--']

u+05be looks like a bug in unidecode. I sent a PR in https://github.com/avian2/unidecode/pull/12.

jacquerie commented 7 years ago

This issue was moved to inspirehep/inspire-schemas#212