Closed ndushay closed 5 years ago
Given above, my suggestion is:
We have auc_common_config
We have auc_date_range_from_all_config
(or some better name) and use it for getting date range fields for all colls except coll 8. It uses the existing auc date parsing macro for the date range fields, utilizing all occurences of dc:date
.
We have auc_date_range_from_first_config
(or some better name) and use it for getting date range fields only for coll 8. It uses the existing auc date parsing macro, but it only passes in the first
dc:date
field, using traject first
to get that single value.
I have all the raw data loaded into a spreadsheet - one sheet for each collection. We can go over it together if you like.
Jacob says, given draft PR #327 in branch "auc-redux":
there seem to be several problems with AUC data. As far can tell all date values or getting duplicated in the
cho_date_range_norm
field; the range is being generated twice. It might be that each dc:date value is generating its own range and the ranges aren’t getting merged, not sure exactly. Also, I don’t think the coll8 problem got fixed. It seems to be grabbing all dc:date values still. Also, this pattern ‘1920-1990’ seems to be dropping the last year in the range.
oai:cdm15795.contentdm.oclc.org:p15795coll32/188
['1940s-1990', '1940; 1941; 1942; 1943; 1944; 1945; 1946; 1947; 1948; 1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956; 1957; 1958; 1959; 1960; 1961; 1962; 1963; 1964; 1965; 1966; 1967; 1968; 1969; 1970; 1971; 1972; 1973; 1974; 1975; 1976; 1977; 1978; 1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986; 1987; 1988; 1989; 1990'] ========> [1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989]
problem: ending year should be 1990
other examples without ids:
['1920-1990', '1920; 1921; 1922; 1923; 1924; 1925; 1926; 1927; 1928; 1929; 1930; 1931; 1932; 1933; 1934; 1935; 1936; 1937; 1938; 1939; 1940; 1941; 1942; 1943; 1944; 1945; 1946; 1947; 1948; 1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956; 1957; 1958; 1959; 1960; 1961; 1962; 1963; 1964; 1965; 1966; 1967; 1968; 1969; 1970; 1971; 1972; 1973; 1974; 1975; 1976; 1977; 1978; 1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986; 1987; 1988; 1989; 1990'] ========> [1920, 1990, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1920, 1990, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989]
['1883-1891', '1883; 1884; 1885; 1886; 1887; 1888; 1889; 1890; 1891'] ========> [1883, 1891, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1883, 1891, 1884, 1885, 1886, 1887, 1888, 1889, 1890]
oai:cdm15795.contentdm.oclc.org:p15795coll30/727
['1300-1301'] ========> [1300, 1301, 1300, 1301]
problem: years are duplicated. There are many examples of this; it looks like all records (or almost all records). e.g.
['1973-05-16'] ========> [1973, 1973]
['1992-03-22'] ========> [1992, 1992]
['1974-09-30'] ========> [1974, 1974]
['1925-07-10'] ========> [1925, 1925]
['1949-04-01'] ========> [1949, 1949]
coll8
['1899', '2012-09-12', '2016-04-06'] ========> [1899, 2012, 2016, 1899]
problem: it looks like all dates are still being parsed for coll8 instead of only the first dc:date value.
oai:cdm15795.contentdm.oclc.org:p15795coll6/219
['1960s?', '1960-1969'] ========> [1960, 1969, 1960, 1969]
problem: not capturing the whole range and duplicating start and end years
Confirmed.
dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll8
got this for a record:
"cho_date":["1896","2012-10-15","2016-04-06"],
"cho_date_range_norm":[1896,2012,2016,1896],
"cho_date_range_hijri":[1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1313,1314],
Solution: don't have cho_date_range_norm
field in auc_common_config
dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll8/data/0.xml
got this for a record:
"cho_date":["1897","2012-09-12","2016-04-06"],
"cho_date_range_norm":[1897],
"cho_date_range_hijri":[1314,1315],
cho_date_range_norm
Confirmed. Running only coll 9:
dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll9
"cho_date":["1882"],
"cho_date_range_norm":[1882,1882],
"cho_date_range_hijri":[1299,1300,1299,1300],
Solution: don't have cho_date_range_norm
field in auc_common_config
dlme-transform (auc-redux)]$ docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll9/data/0.xml
got this for a record:
"cho_date":["1882"],
"cho_date_range_norm":[1882],
"cho_date_range_hijri":[1299,1300],
NOT confirmed; I even see the last year in Jacob's date values; it's just that the values aren't sorted properly (but with PR #324 that is addressed)
docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date": ['1940s-1990', '1940; 1941; 1942; 1943; 1944; 1945; 1946; 1947; 1948; 1949; 1950; 1951; 1952; 1953; 1954; 1955; 1956; 1957; 1958; 1959; 1960; 1961; 1962; 1963; 1964; 1965; 1966; 1967; 1968; 1969; 1970; 1971; 1972; 1973; 1974; 1975; 1976; 1977; 1978; 1979; 1980; 1981; 1982; 1983; 1984; 1985; 1986; 1987; 1988; 1989; 1990']
"cho_date_range_norm": [1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1940, 1990, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989]
Solution: master branch now sorts values in date range arrays
docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date":["1960s?","1960-1969"],
"cho_date_range_norm":[1960,1961,1962,1963,1964,1965,1966,1967,1968,1969],
"cho_date_range_hijri":[1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389]
problem: not capturing the whole range
"cho_date": ['1960s?', '1960-1969'],
"cho_date_range_norm": [1960, 1969, 1960, 1969]
docker run --rm -e SKIP_FETCH_DATA=true -v $(pwd)/.:/opt/traject -v $(pwd)/../dlme-metadata:/opt/traject/data -v $(pwd)/output:/opt/traject/output suldlss/dlme-transform:latest auc/p15795coll6/data/37.xml
"cho_date":["1960s?","1960-1969"],
"cho_date_range_norm":[1960,1961,1962,1963,1964,1965,1966,1967,1968,1969],
"cho_date_range_hijri":[1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389]
closing this per Jacob closing #363
AUC Collections
single
dc:date
field1960; 1961; ... ; 1989
-parse_range
macro will workparse_range
macro will workyyyy
,undated
,yyyy; yyyy; yyyy
- existing auc macro will work1882
-parse_range
macro will workyyyy
,yyyy; yyyy; yyyy
- existing auc macro will work1980
-parse_range
macro will workyyyy
,yyyy-mm
,[n.d. 1914?]
-parse_range
macro will work1870s
,parse_range
macro will work eventuallyyyyy-mm
-parse_range
macro will work2016-mm-dd
(yes, valid per Jacob) -parse_range
macro will workyyyy-mm-dd
,yyyy-mm
-parse_range
macro will workyyyy-mm-dd
-parse_range
macro will workparse_range
; yes Gregorian per Jacob)1964-1967?
-parse_range
macro will workmultiple
dc:date
fieldsdc:date
occurrencesdc:date
occurrencesdc:date
fielddc:date
occurrencesca. late 19th century or early 20th century
+yyyy; yyyy; ...
- existing auc macro (take range if provided) should workyyyy
values thrown in, too