Open metasj opened 5 years ago
This is related to / came out of the discussion around issue #33
Currently the most common doc title is 'Title', generated when no title metadata is found.
Are we actually setting the document's title value to "Title"
somewhere? I can't see where this happens. I do see some title negotiation in the file-parser, but it falls back to null
rather than "Title"
. (Postgres allows Documents.title
to be null, which I just verified by uploading a PDF with no Title metadata.)
So do we want to:
title
field null unless explicitly found or edited, but generate a guessed title (using metadata, first few words, whatever) wherever we display the file's name?In other words, should this generation happen while creating the document or only while displaying it?
To put it another way, we generally have three content scenarios with distinct storage/display needs:
Documents.title
and displays directly.Thoughts:
1: yes, currently works
2: resolve later. (I'd say: first resolve how uploaders or admins can update/correct a title; then have a script that does this based on {order of inference}.)
3: yes, display "Untitled Document", say in italics. don't update any metadata field
Excellent! So scenario 1 works, scenario 3 has been added in #5, the first part of scenario 2 (manual fixing) is in process in #6, and so this issue is specifically about the second part of scenario 2 (inference). 👍
Currently the most common doc title is 'Title', generated when no title metadata is found.
We should try to generate a better default title, looking for related metadata fields or perhaps the opening string of the text. If nothing seems to work, fall back to current behavior.