Open NupurBharadwaj opened 5 years ago
@joeltg Do you have any sense of where in the pipeline the breakdown here is likely happening?
Hmm... Tika gives us a JSON metadata object for each document that may have several dates in it. There's no guarantee that any particular metadata key will be detected for any particular document, and there are several different date keys that will occasionally show up, including
created
date
Creation-Date
Last-Modified
Last-Save-Date
The way we set publication dates now is just by checking for created
, and if its not there, leaving it null. Tika can be weird about what ends up getting detected so it's possible that nothing's wrong, and the documents just didn't have a date that Tika identified as created
. But that feels a little unlikely and it probably means we have to a) develop a more robust way of identifying publication dates (understanding when which keys are returned as what), b) develop a way of re-parsing metadata out of some/all documents, and c) run that as we iterate on a).
Many documents don't have Publication Date clearly delineated, at best they have the creation (= scanning) date. Manuals often do on one of the first few pages; and uploaders should be able to specify explicitly.
For the 2d part of this bug: filters and extensions should fall back to working on the upload date if that is all that is available. (Publication date is certainly no later than that; and that's the date we can most strongly confirm.)
Possible steps to add publication dates (worth a separate issue):
None of the Cisco documents appear to have publication dates. Additionally, any filters or extensions attached to the publication date are not working as expected either.