sul-dlss / dlme-transform

Transforms raw DLME metadata to DLME intermediate representation
Apache License 2.0
0 stars 2 forks source link

date parsing macros for getting gregorian and hijri date strings when both are present #289

Closed ndushay closed 5 years ago

ndushay commented 5 years ago

Openn

some openn colls have some date strings like this: (actual date strings illustrating all the patterns present at this time)

A.H. 986 (1578)
A.H. 899 (1493-1494)
A.H. 901-904 (1496-1499)
A.H. 1240 (1824)
A.H. 1258? (1842)
A.H. 1224, 1259 (1809, 1843)
A.H. 1123?-1225 (1711?-1810)
ca. 1670 (A.H. 1081)
1269 A.H. (1852)

We need a macro to get the hijri string (without A.H.) and a separate macro (or an argument passed to a single macro) to get the gregorian string.

ndushay commented 5 years ago

AUC

only Gregorian dates provided (oai_dc)

Bodleian

only Gregorian dates provided (#246) (json)

Stanford

only Gregorian dates provided (#247) (mods)

ndushay commented 5 years ago

Cambridge Islamic

has some Hijri dates (see issue #182)

islamic-49.xml  <origDate calendar="Hijri-qamari" when="1231-01-01" instant="false">628 A.H. / 1231 C.E.</origDate>
islamic-52.xml  <origDate calendar="Gregorian" when="1592" unit="mm">1000 A.H. / 1592 C.E.</origDate>
islamic-53.xml  <origDate calendar="Hijri-qamari" when="1359" instant="false">760</origDate>
islamic-55.xml  <origDate calendar="Hijri-qamari">undated</origDate>|<origDate calendar="Gregorian"/>
islamic-57.xml  <origDate calendar="Hijri-qamari" from="1000-01-01" to="1610-12-31">  
<origDate calendar="Hijri-qamari" when="1231-01-01" instant="false">628 A.H. / 1231 C.E.</origDate>
islamic-59.xml  <origDate calendar="Hijri-qamari" when="1566" instant="false">974 AH / 1566 CE</origDate>
ndushay commented 5 years ago

Harvard IHP

see #183 (oai_dc) when Hijri date is supplied, the Gregorian is within square brackets, eg

    1322 [1904]
    1317 [1899 or 1900]
    1288 [1871-72]
    1254 [1838 or 39]
    1894.
    1890-
    1886-1887
    1890?].
    1890?]

BUT: there are multiple dc:date fields ... so it's possible it will be easier to get the two diff calendar values that way

ndushay commented 5 years ago

sakip-sabanci/KitapVehat:

This is the only one of the 4 sakip-sabanci collections with any hijri date strings already in the data

see #258 (oai_dc). Multiple dc:date fields in most (all?) records

998 H (1590 M)
1101 H (1689-1690 M)
1269, 1272, 1273 H (1853, 1855, 1856 M)
1194 H (1780 M)
887 H (1482 M)
1319 H (1901-1902 M)
1240, 1248 H (1825, 1832 M)
1076 H (1665-1666)
1080 H (1669-1670 M)

    1335 civarı
    17. yüzyıl başı
    16. - 17. yüzyıl
    18. yüzyılın ortaları
    19. yüzyılın ikinci yarısı
    muhtemelen 1790 civarı
    1958
    Tarihsiz (meaning undated)
    muhtemelen 1809-1810 sonrası (meaning probably after 1809-1810)

Due to the large variation in strings of Turkish language dates, I would recommend ignoring these for now. One strategy could be to:

  1. look for four digits followed by a space and capital H -> grab the four digits and pass to hijri_to_gregorian method
  2. If no H pass four digits as a Gregorian year
  3. deal with this '1269, 1272, 1273 H'
  4. ignore the rest for now
ndushay commented 5 years ago

Harvard variant to be handled with separate macro. See comment on #183