prior-art-archive / priorartarchive.org

Prior Art Archive Site
https://priorartarchive.org
GNU General Public License v2.0
3 stars 1 forks source link

Generate a more informative title when no title-metadata is found #24

Open metasj opened 5 years ago

metasj commented 5 years ago

Currently the most common doc title is 'Title', generated when no title metadata is found.

We should try to generate a better default title, looking for related metadata fields or perhaps the opening string of the text. If nothing seems to work, fall back to current behavior.

slifty commented 5 years ago

This is related to / came out of the discussion around issue #33

reefdog commented 5 years ago

Currently the most common doc title is 'Title', generated when no title metadata is found.

Are we actually setting the document's title value to "Title" somewhere? I can't see where this happens. I do see some title negotiation in the file-parser, but it falls back to null rather than "Title". (Postgres allows Documents.title to be null, which I just verified by uploading a PDF with no Title metadata.)

So do we want to:

  1. Actually generate and store a title in the database.
  2. Keep the title field null unless explicitly found or edited, but generate a guessed title (using metadata, first few words, whatever) wherever we display the file's name?

In other words, should this generation happen while creating the document or only while displaying it?

reefdog commented 5 years ago

To put it another way, we generally have three content scenarios with distinct storage/display needs:

  1. We are very confident in the title: It came from explicit title metadata fields or user-added/edited titles (once we do #18). Stores in Documents.title and displays directly.
  2. We found something we think will work as a title: There was no explicit title, but we inferred one from filename, metadata, first few words of body, or… ? Open questions: a. What is the actual order of inference? b. Do we calculate these on display, or while creating the document? c. Do we need to alert the user that this was an inferred value?
  3. We can't find anything that will work as a title. Display a placeholder (e.g., "Untitled Document") and clearly indicate to the user that this is a placeholder.
metasj commented 5 years ago

Thoughts:

1: yes, currently works 2: resolve later. (I'd say: first resolve how uploaders or admins can update/correct a title; then have a script that does this based on {order of inference}.)
3: yes, display "Untitled Document", say in italics. don't update any metadata field

reefdog commented 5 years ago

Excellent! So scenario 1 works, scenario 3 has been added in #5, the first part of scenario 2 (manual fixing) is in process in #6, and so this issue is specifically about the second part of scenario 2 (inference). 👍