Many document titles are non-descriptive

judell commented 4 years ago

Ideally we would have better titles to show than PDF.js viewer, via Hypothesis, Untitled, etc.

Here are the stats:

There are good titles as you get farther down the list, but about 7k of 17k distinct titles are useless. There will be multiple causes of this. Identifying and characterizing them would be a next step.

https://metabase.hypothes.is/question/776

judell commented 4 years ago

This matters particularly in LMS land because the target_uri is almost always obfuscated. It could be mined for readable strings but ideally there will be ways to capture better titles.

https://metabase.hypothes.is/question/778 shows target_uri for one school. A handful are readable and meaningful, most are not.

judell commented 4 years ago

I'm reworking some schoolwide dashboard views that illustrate patterns of document use.

We have two user-relevant identifiers for documents: the document title, and the URL.

The URL varies a lot for a given document id, because Google and other CMSs form many access urls for the same underlying doc. So I've been providing a complementary view by doctitle.

But it turns out those vary a lot too. For example:

select distinct title from annotation a
inner join doctitle dt on a.document_id = dt.document_id 
where document_id = '136212'

Here are all the titles we get for that one document:

2091130
557895
83573dea827e76e31d15f1cee3d1b982.pdf
Karen Rosenberg's Reading Games(1).pdf
PDF.js viewer
Rosenberg_Reading Games_Strategies for Reading Scholarly Sources.pdf
rosenberg--reading-games.pdf
via Hypothesis

I'm going to provide two complementary views, one unique by doctitle and one unique by URL. Neither will be entirely satisfactory, though, because both URLs and document titles are fairly uncontrolled namespaces.

klemay commented 4 years ago

As an update @judell we're planning a spike for this slotted after pagefit for PDFs (so within the next couple of sprints)

judell commented 4 years ago

More on this here: https://hypothes-is.slack.com/archives/CLQUUEVMY/p1590596504010900

klemay commented 4 years ago

Recent discussion of possible solutions: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1594062728411100

judell commented 4 years ago

Suggestion for mitigating the problem:

Client grabs the first 200 characters of the doc's visible text.
Sends that data along with other metadata
The dashboards that report doctitles can fall back to it when the title is useless.

judell commented 4 years ago

Here is the above-mentioned intervention, @klemay. For PDFs only, it adds the doc's initial text (as dc.preview) to the metadata sent from the annotator to the sidebar and thence to the backend. If we deem this to be low-risk, I think it would be helpful to do it sooner rather than later. The previews will be immediately useful in dashboards, and by accumulating a corpus of them we'll have additional fodder for the spike when we get there.

diff --git a/src/annotator/guest.coffee b/src/annotator/guest.coffee
index 06139f0f..696981c2 100644
--- a/src/annotator/guest.coffee
+++ b/src/annotator/guest.coffee
@@ -144,6 +144,12 @@ module.exports = class Guest extends Delegator
     })

     return Promise.all([metadataPromise, uriPromise]).then ([metadata, href]) =>
+
+      if metadata.documentFingerprint
+        metadata.dc = {
+          preview: document.querySelector('div.page').textContent.slice(0,200)
+        }
+
       return {
         uri: normalizeURI(href),
         metadata,

klemay commented 4 years ago

@judell thanks for this - I know this is something @jon-betts is keen to look into - I don't think we have room in the sprint starting tomorrow (Blackboard file integration is highest priority for our backend efforts at the moment), but it'll be up for discussion next sprint

judell commented 4 years ago

FWIW this is a client-side intervention, and basically just a one-liner.

robertknight commented 10 months ago

In the LMS context, the appropriate solution in most contexts is to group annotations by assignment rather than document, and use the assignment title instead of the document title. See also https://github.com/hypothesis/lms/issues/5810.

There are situations where a document is re-used across assignments and we may want to present document-level search and statistics. One common case for that is when the assignment is an ebook. For those we usually do have good title metadata. The main case where metadata is poor is for PDFs.

hypothesis / lms

Many document titles are non-descriptive #1371