prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Documents are missing a publication date #14

Open NupurBharadwaj opened 5 years ago

NupurBharadwaj commented 5 years ago

None of the Cisco documents appear to have publication dates. Additionally, any filters or extensions attached to the publication date are not working as expected either.

slifty commented 5 years ago

@joeltg Do you have any sense of where in the pipeline the breakdown here is likely happening?

joeltg commented 5 years ago

Hmm... Tika gives us a JSON metadata object for each document that may have several dates in it. There's no guarantee that any particular metadata key will be detected for any particular document, and there are several different date keys that will occasionally show up, including

The way we set publication dates now is just by checking for created, and if its not there, leaving it null. Tika can be weird about what ends up getting detected so it's possible that nothing's wrong, and the documents just didn't have a date that Tika identified as created. But that feels a little unlikely and it probably means we have to a) develop a more robust way of identifying publication dates (understanding when which keys are returned as what), b) develop a way of re-parsing metadata out of some/all documents, and c) run that as we iterate on a).

metasj commented 5 years ago

Many documents don't have Publication Date clearly delineated, at best they have the creation (= scanning) date. Manuals often do on one of the first few pages; and uploaders should be able to specify explicitly.

For the 2d part of this bug: filters and extensions should fall back to working on the upload date if that is all that is available. (Publication date is certainly no later than that; and that's the date we can most strongly confirm.)

Possible steps to add publication dates (worth a separate issue):

  1. Give uploaders a way to specify publication date
  2. Find a library for automatic extraction from a pdf (ideally one that notes which of its tools/patterns was used to return its estimate).
  3. Have a qualifier by 'Publication Date' indicating how it was estimated (self report; contributing site/library; extraction tool; &c)
  4. Where a publication date exists, show 'Upload Date' next to it.