sul-dlss / dlme-transform

Transforms raw DLME metadata to DLME intermediate representation
Apache License 2.0
0 stars 2 forks source link

re-analyze AU Cairo collections for date range parsing #326

Closed ndushay closed 5 years ago

ndushay commented 5 years ago

AUC Collections

single dc:date field

  1. coll 1 - all same value: 1960; 1961; ... ; 1989 - parse_range macro will work
  2. coll 2 - parse_range macro will work
  3. coll 5 - yyyy, undated, yyyy; yyyy; yyyy - existing auc macro will work
  4. coll 9 - all same value: 1882 - parse_range macro will work
  5. coll 11 - yyyy, yyyy; yyyy; yyyy - existing auc macro will work
  6. coll 12 - all same value: 1980 - parse_range macro will work
  7. coll 15 - yyyy, yyyy-mm, [n.d. 1914?] - parse_range macro will work
  8. coll 16 - all same value: 1870s, parse_range macro will work eventually
  9. coll 19 - yyyy-mm - parse_range macro will work
  10. coll 24 - 2016-mm-dd (yes, valid per Jacob) - parse_range macro will work
  11. coll 26 - yyyy-mm-dd, yyyy-mm - parse_range macro will work
  12. coll 29 - yyyy-mm-dd - parse_range macro will work
  13. coll 30 - existing auc macro will work (one semicolon, rest ok for parse_range; yes Gregorian per Jacob)
  14. coll 31 - all same value: 1964-1967? - parse_range macro will work

multiple dc:date fields

ndushay commented 5 years ago

Given above, my suggestion is:

We have auc_common_config We have auc_date_range_from_all_config (or some better name) and use it for getting date range fields for all colls except coll 8. It uses the existing auc date parsing macro for the date range fields, utilizing all occurences of dc:date.

We have auc_date_range_from_first_config (or some better name) and use it for getting date range fields only for coll 8. It uses the existing auc date parsing macro, but it only passes in the first dc:date field, using traject first to get that single value.

I have all the raw data loaded into a spreadsheet - one sheet for each collection. We can go over it together if you like.

ndushay commented 5 years ago

Jacob says, given draft PR #327 in branch "auc-redux":

there seem to be several problems with AUC data. As far can tell all date values or getting duplicated in the cho_date_range_norm field; the range is being generated twice. It might be that each dc:date value is generating its own range and the ranges aren’t getting merged, not sure exactly. Also, I don’t think the coll8 problem got fixed. It seems to be grabbing all dc:date values still. Also, this pattern ‘1920-1990’ seems to be dropping the last year in the range.

oai:cdm15795.contentdm.oclc.org:p15795coll32/188
    ['1940s-1990', '1940; 1941; 1942; 1943; 1944; 1945; 1946; 1947; 1948;   1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956; 1957; 1958; 1959;   1960; 1961; 1962; 1963; 1964; 1965; 1966; 1967; 1968; 1969; 1970;   1971; 1972; 1973; 1974; 1975; 1976; 1977; 1978; 1979; 1980; 1981;   1982; 1983; 1984; 1985; 1986; 1987; 1988; 1989; 1990'] ========>     [1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949,  1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,   1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,   1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,   1983, 1984, 1985, 1986, 1987, 1988, 1989, 1940, 1990, 1941, 1942,   1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953,   1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964,   1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975,   1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,   1987, 1988, 1989]
    problem: ending year should be 1990
    other examples without ids:
        ['1920-1990', '1920; 1921; 1922; 1923; 1924; 1925; 1926;        1927; 1928; 1929; 1930; 1931; 1932; 1933; 1934; 1935; 1936;         1937; 1938; 1939; 1940; 1941; 1942; 1943; 1944; 1945; 1946;         1947; 1948; 1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956;         1957; 1958; 1959; 1960; 1961; 1962; 1963; 1964; 1965; 1966;         1967; 1968; 1969; 1970; 1971; 1972; 1973; 1974; 1975; 1976;         1977; 1978; 1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986;         1987; 1988; 1989; 1990'] ========> [1920, 1990, 1921, 1922,      1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932,         1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942,         1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952,         1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962,         1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972,         1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,         1983, 1984, 1985, 1986, 1987, 1988, 1989, 1920, 1990, 1921,         1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931,         1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941,         1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951,         1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961,         1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,         1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,         1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989]
        ['1883-1891', '1883; 1884; 1885; 1886; 1887; 1888; 1889;        1890; 1891'] ========> [1883, 1891, 1884, 1885, 1886, 1887,      1888, 1889, 1890, 1883, 1891, 1884, 1885, 1886, 1887, 1888,         1889, 1890]
oai:cdm15795.contentdm.oclc.org:p15795coll30/727
    ['1300-1301'] ========> [1300, 1301, 1300, 1301]
    problem: years are duplicated. There are many examples of this; it  looks like all records (or almost all records). e.g.
        ['1973-05-16'] ========> [1973, 1973]
        ['1992-03-22'] ========> [1992, 1992]
        ['1974-09-30'] ========> [1974, 1974]
        ['1925-07-10'] ========> [1925, 1925]
        ['1949-04-01'] ========> [1949, 1949]
coll8   
    ['1899', '2012-09-12', '2016-04-06'] ========> [1899, 2012, 2016,    1899]
    problem: it looks like all dates are still being parsed for coll8   instead of only the first dc:date value.
oai:cdm15795.contentdm.oclc.org:p15795coll6/219
    ['1960s?', '1960-1969'] ========> [1960, 1969, 1960, 1969]
    problem: not capturing the whole range and duplicating start and end    years
ndushay commented 5 years ago

Problem 1: coll 8 shouldn't have values from anything but first date value.

Confirmed.

dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll8

got this for a record:

"cho_date":["1896","2012-10-15","2016-04-06"],
"cho_date_range_norm":[1896,2012,2016,1896],
"cho_date_range_hijri":[1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1313,1314],

FIXED

Solution: don't have cho_date_range_norm field in auc_common_config

dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll8/data/0.xml

got this for a record:

"cho_date":["1897","2012-09-12","2016-04-06"],
"cho_date_range_norm":[1897],
"cho_date_range_hijri":[1314,1315],
ndushay commented 5 years ago

Problem 2: duplication of date values in cho_date_range_norm

Confirmed. Running only coll 9:

dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll9
"cho_date":["1882"],
"cho_date_range_norm":[1882,1882],
"cho_date_range_hijri":[1299,1300,1299,1300],

FIXED

Solution: don't have cho_date_range_norm field in auc_common_config

dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll9/data/0.xml

got this for a record:

"cho_date":["1882"],
"cho_date_range_norm":[1882],
"cho_date_range_hijri":[1299,1300],
ndushay commented 5 years ago

Problem 3: missing last year from hyphenated years

NOT confirmed; I even see the last year in Jacob's date values; it's just that the values aren't sorted properly (but with PR #324 that is addressed)

docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date": ['1940s-1990', '1940; 1941; 1942; 1943; 1944; 1945; 1946; 1947; 1948;   1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956; 1957; 1958; 1959;   1960; 1961; 1962; 1963; 1964; 1965; 1966; 1967; 1968; 1969; 1970;   1971; 1972; 1973; 1974; 1975; 1976; 1977; 1978; 1979; 1980; 1981;   1982; 1983; 1984; 1985; 1986; 1987; 1988; 1989; 1990'] 
"cho_date_range_norm": [1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949,  1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,   1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,   1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,   1983, 1984, 1985, 1986, 1987, 1988, 1989, 1940, 1990, 1941, 1942,   1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953,   1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964,   1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975,   1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,   1987, 1988, 1989]

FIXED

Solution: master branch now sorts values in date range arrays

docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date":["1960s?","1960-1969"],
"cho_date_range_norm":[1960,1961,1962,1963,1964,1965,1966,1967,1968,1969],
"cho_date_range_hijri":[1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389]
ndushay commented 5 years ago

Problem 4: missing years from range value:

problem: not capturing the whole range

"cho_date": ['1960s?', '1960-1969'],
"cho_date_range_norm": [1960, 1969, 1960, 1969]

FIXED

docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date":["1960s?","1960-1969"],
"cho_date_range_norm":[1960,1961,1962,1963,1964,1965,1966,1967,1968,1969],
"cho_date_range_hijri":[1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389]
ndushay commented 5 years ago

closing this per Jacob closing #363