Migrate VitalSource annotations to be associated with book rather than chapter

robertknight commented 1 year ago

As part of the "treat VitalSource books as one document" project, we are changing the URL and selectors that the Hypothesis client captures. These changes are currently behind a feature flag. In order to roll this change out to all users, we will need to migrate the existing annotations to use the same URL and selector format.

The current thinking is that this will be done via a task in the h admin panel that can be run multiple times during the transition, with optional filters to control which users or groups are processed on each run.

The existing annotations have data that looks like this:

EPUB ("reflowable") book example:

{
  target: {
    source: "https://jigsaw.vitalsource.com/books/L-999-70049/epub/OPS/loc_002.xhtml",
    selector: [...]
  }
  // Other fields
}

PDF ("fixed layout") book example:

{
  target: {
    source: "https://jigsaw.vitalsource.com/books/9781938168130/pages/391876767/content",
    selector: [...]
  }
  // Other fields
}

The migrated annotations will look like this:

{
  target: {
    source: "https://bookshelf.vitalsource.com/reader/books/L-999-70049",
    selector: [{
      type: "EPUBContentSelector",
      url: "https://jigsaw.vitalsource.com/books/L-999-70049/epub/OPS/loc_002.xhtml",

      // Fields not available in previous data. Need to be either omitted or found via lookup
      cfi: "/4",
      title: "Chapter 2"
    },
    // Other selectors...
    ]
  }
  // Other fields
}

Note that some of the information that is needed in the new format is not available in the existing data. We will either need to make everything work without it, or look the information up via requests to the VitalSource metadata API.

robertknight commented 1 year ago

The format of the new selectors is described at https://github.com/hypothesis/client/blob/7267c198adbf31bcd0bf0065aa376b3a4bf2702e/src/types/api.ts#L75. Only the "url" field is currently marked as required.

For PDF/fixed-layout books, there is also a PageSelector selector. None of the information in that selector is available in the old API, so it would need to be looked up via the VS API.

robertknight commented 1 year ago

Some notes on how many VitalSource annotations will need to be migrated and the books and Hypothesis groups they are associated with: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1664901048607019.

seanh commented 1 year ago

Are new VitalSource annotations with the old format still being created? If not then I wonder about doing a one-off DB migration to migrate all annotations to the new format in the DB. That's what we'd normally do if we wanted to migrate a bunch of data in the DB.

Uou'd then use the admin pages to reindex those annotations. There are already various admin pages in h to reindex all annotations of a user/group/etc. You may be able to use one of those, or you may need to add a new one.

I think it should also be possible to write a "migrate to the new VitalSource format" admin page if you want to do it that way. But this will be the first time we've written a Celery task to do a bulk migration on the DB, those've always been done using DB migrations in the past. The task would also schedule each annotation for reindexing after the annotation has been changed in the DB. And I suppose once you're finished, you'll delete the admin page?

Do we know what volume of annotations we're talking about here?

robertknight commented 1 year ago

Do we know what volume of annotations we're talking about here?

Number of annotations: 12,656. (2023-01-24 update: 15,360) h DB Query:

select count(*) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of groups: 75 (2023-01-24 update: 101) Query:

select count(distinct(annotation.groupid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of users: 432 (2023-01-24 update: 726) Query:

select count(distinct(annotation.userid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of URLs (== number of distinct chapters/pages): (2023-01-24 update: 655)

select count(distinct(uri_normalized)) from document_uri where uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

robertknight commented 1 year ago

Given that there are only a small number of groups, we could use the existing "Reindex all annotations in a group" facility in the search index management page at http://localhost:5000/admin/search to handle reindexing. It would be more convenient if we modified that form to support supplying a list of groups (eg. as a comma-separated list) to reindex.

robertknight commented 1 year ago

As noted in the issue description, the migrated annotations should include some data which is not present in the original annotation:

// Fields not available in previous data. Need to be either omitted or found via lookup
cfi: "/4",
title: "Chapter 2"

The cfi field is used to sort annotations by chapter in the sidebar. Then annotations within each chapter are sorted by text position. The title field is used to display chapter headings in the sidebar.

We could omit these fields and make the client dynamically look up the CFI and title that correspond to the path value, by querying the VitalSource reader. However this would mean that we'd be missing this information when presenting annotations outside of the reader.

To add this data during the migration, we have a couple of options:

Generate a data set (eg. as a JSON file) of all the CFIs and chapter titles for all books annotated so far, add that to the h repo and use it locally during a migration
Make HTTP requests to VitalSource's API in order to fetch the information during a migration. The LMS app has code that already uses the same API for use in the assignment picker

A total of 74 different books have been annotated so far.

Query:

select distinct(substring(uri_normalized, '/books/[0-9A-Z-]+')) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

robertknight commented 1 year ago

I think it might be helpful to do this annotation in several stages:

Deploy https://github.com/hypothesis/client/pull/5072, so we capture the new selectors for all new annotations
Run a migration to backfill the new selectors for annotations created prior to (1). This can be done prior to enabling the book_as_single_document feature flag for everyone. This will require additional data beyond what is in the DB, per my previous comment.
Enable the book_as_single_document feature flag for everyone.
Run a final migration to update the annotation URLs from https://jigsaw.bookshelf.com/books/{id}/{suffix} to https://bookshelf.vitalsource.com/reader/books/{id}. This will not require any data beyond what is stored in the annotations.

robertknight commented 1 year ago

Some notes on step (2) of the migration:

For each existing annotated VitalSource URL (example: "https://jigsaw.vitalsource.com/books/L-999-70049/epub/OPS/loc_002.xhtml") we need to:

Extract the book ID ("L-999-70049" here)
Extract the content path ("epub/OPS/loc_002.xhtml" here)
Look up the table of contents data ("TOC data") for the book via the VitalSource API. See LMS app for code that does this.
Find the first entry with a matching path in the TOC data
Generate an "EPUBContentSelector" selector in JSON format (see PR description)
Find all annotations with the old URL
Read the target_selectors data (this should be a JSON array) and append the "EPUBContentSelector" JSON from step (5)

robertknight commented 1 year ago

I'm currently working on a script to gather the data needed for the backfilled EPUBContentSelector selectors. I encountered an issue with PDF-based books, as not all pages have an entry in the table of contents. See https://vitalsource.slack.com/archives/C01208U1A2F/p1671548778110049.

robertknight commented 1 year ago

Using the above APIs I got a dump of the TOC and pages data for all the VS books annotated so far. See https://drive.google.com/file/d/16FMKv2VmKDnpZEzdA-3MTc4W22c1pPHB/view?usp=share_link (H internal only). This covers steps 1-3.

robertknight commented 1 year ago

I have a first pass of a JSON file containing the data for the updates we'll need to apply: https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 contains a mapping from the current URL of an annotation, to an object with url and selectors fields. The annotation's URL needs to be changed to the value in the "url" field, and the entries in selectors need to be added to the target_selectors field of the annotation, but only if there is not already an entry in that list with a matching type property.

The JSON output here was generated from an input list of current annotation URLs using this script.

This data is not final because there were some URLs in the input list for which I could not find the necessary entries in the VitalSource data, and I need to check some issues relating to the "title" field for some entries. These issues won't affect the structure of the data though.

robertknight commented 1 year ago

I have updated the data at https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 with document titles. When we migrate annotation URLs, we'll need to make sure document entries get created for the new URLs and have at least the titles set. The data now looks like:

robertknight commented 1 year ago

There were a small number of annotated PDF page URLs which no longer appear in the page index for the book. I suspect what has happened is that the book has been updated or re-processed since it was originally annotated. We didn't record page numbers or CFIs at the time when these annotations were created, so we can't easily locate the correct page in the book. Fortunately for all new annotations that are created, we are capturing the CFI and page number.

Log output from https://github.com/hypothesis/vitalsource-url-migration/blob/main/gen_epub_selectors.py:

Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077445/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077446/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077447/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077448/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077449/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077450/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077451/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077452/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584498507/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584498510/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584499351/content: Could not find CFI or title data for chapter

robertknight commented 1 year ago

The latest version of the data that we'll need for the migration is now at https://github.com/hypothesis/vitalsource-url-migration/blob/main/vs-selectors.json. It has updated URLs and document (book) titles for all books. A small number of chapter/page URLs, mentioned in the previous comment, still had to be skipped.

robertknight commented 1 year ago

Looking through a list of all the document titles that were fetched, I see there are some HTML entities and character references (", ') in their which we'll need to convert to Unicode. I'll do that as part of an update to the script data.

&quot;T. rex&quot; and the Crater of Doom
2019 MyLab Management with Pearson eText for Fundamentals of Human Resource Management plus Third Party eText
A Christmas Carol
A Citizen’s Guide to the Political Psychology of Voting
A Tale of Two Cities
American Government
Anatomy and Physiology
Automating Inequality
Behavioral Neuroscience
Biology: A Global Approach, Enhanced eBook, Global Edition
Bookshelf Tutorial
Build and Program Your Own LEGO Mindstorms EV3 Robots
Cardiología
Chemical Process Safety
Children&#39;s Play
College Algebra Essentials
Concise Text of Neuroscience
Deep Learning with Python, Second Edition
Discovering Psychology
Diversity in America
Doing Visual Ethnography
EBOOK: Economics, 12e
Engineering Fluid Mechanics, Enhanced eText
Essentials of Marketing Research
Everything's An Argument with Readings
Everything's an Argument with Readings
Fundamentals of General, Organic, and Biological Chemistry (Subscription)
Give Me Liberty!: An American History (Seagull Sixth Edition)  (Vol. 2)
Great Expectations
Head First Mobile Web
Heart of Darkness (Fifth Edition)  (Norton Critical Editions)
How Music Works
International Economics (Subscription)
International Relations: A Very Short Introduction
Introducing Relativity
Kant: Groundwork of the Metaphysics of Morals
Listening Well
Macroeconomics
Marketing
Media/Society
Methods for Teaching Students with Autism Spectrum Disorders
Operations Management: Processes and Supply Chains
Paradise Lost
Personal Connections in the Digital Age
Personality Psychology: Domains of Knowledge About Human Nature
Philosophy in the United States
Physics for Engineers and Scientists (Third Edition)  (Vol. 2)
Politics and International Law
Popular Culture, Geopolitics, and Identity
Principles of Economics
Qualitative Research Design
Reason
Salt, Fat, Acid, Heat
Selling School: The Marketing of Public Education
Service Management: Operations, Strategy, Information Technology
Silencing the Past (20th anniversary edition)
Sustainability: A Comprehensive Foundation
Teaching through Text
Testing Hypothesis in Bookshelf Online
The Greek Plays
The Language of Confession, Interrogation, and Deception
The Learner-Centered Curriculum: Design and Implementation
The Life of Sir Thomas More
The Lost Boys of Zeta Psi: A Historical Archaeology of Masculinity at a University Fraternity
The Political Philosophy of AI
The Pragmatic Programmer
The Routledge Handbook of Social Work and Addictive Behaviors
The Shattering: America in the 1960s
The Soul of A New Machine
The Spirit of Laws
U.S. History
US: A Narrative History Volume 1: To 1877
Understanding Cisco Networking Technologies, Volume 1
Understanding World Regional Geography
University Physics for the Life Sciences (Subscription)
Vladimir Putin: Life Coach
Welcoming Young Children into the Museum
Writing about Writing
Wuthering Heights
Wyllie's Treatment of Epilepsy

robertknight commented 1 year ago

The migration has been initiated and is expected to complete in the next 20 minutes or so. Slack thread with operations analysis here: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674638705104229.

robertknight commented 1 year ago

The bulk of the migration is complete. There were a total of 24 out of ~15,400 annotations that could not be migrated. See notes at https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674643433767469?thread_ts=1674638705.104229&cid=C4K6M7P5E.

hypothesis / h

Migrate VitalSource annotations to be associated with book rather than chapter #7709