relaton / relaton-iso

RelatonIso: ISO Standards metadata using the BibliographicItem model
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Process `--` in dates #96

Closed opoudjis closed 3 years ago

opoudjis commented 3 years ago

Follow-on from https://github.com/metanorma/metanorma.com/issues/395

In relaton data scraped from ISO, -- is interpreted as em-dash in much of the text, but as en-dash in the context of dates. That interpretation until now has been done in metanorma, but we are streamlining processing of text substitution, so instead it should be done upstream: it is an idiosyncrasy peculiar to ISO, and not generic behaviour.

So, when you have an instance of -- or -- (with optional surrounding spaces) in dates and only in dates in ISO data, please replace them in output with – (with no surrounding spaces).

andrew2net commented 3 years ago

@opoudjis relaton uses the Date class to format dates before storing them in a bibitem. For example, a scraped date could be a string "February 2012" so relaton converts it to the string "2012-02" using Date#strptime and Date#strftime, and stores it in a bibitem. There are other methods that use Date to parse and format string stored in a bibitem. The Date#strptime, Date#strftime, and Date#parse use dash \u002d as a delimiter. So dates in relaton's bibitem stored with single dash \u002d without surrounding spaces. As I understand we need to render XML by replacing doubled dashes -- and -- with \u2013 in dates, right? We don't need to replace single dash \u002d, do we? Why don't we use \u002d instead of \u2013?

opoudjis commented 3 years ago

There are two kinds of dash involved here.

The delimiter dash should indeed remain a hyphen. But the delimiter dash is not -- in ISO input anyway.

The -- is used in ISO input to indicate a range, as in "2016--2017". That dash needs to be converted into \u2013 for conventional typography, instead of me doing it inconsistently (because ISO -- means \u2014 in all other contexts.)

So yes, we replace -- with \u2013, and no, we are not using \u0026, because that is correctly already rendered in ISO output as single -.

andrew2net commented 3 years ago

I don't see how relaton-iso can encounter ' -- ' or '--'. All date inputs go through Date.parse and outputs with a single hyphen. Date rages like "2016--2017" stored in separated fields (from: 2016, to: 2017)