Revisit storage of document identifiers

The details of which document is used by an assignment are currently stored in the LMS DB as URLs. This was fine for public PDFs and web pages which are naturally identified by HTTP URLs. However we subsequently added support for a variety of other formats which either don't have natural HTTP URLs, such as files in the LMS's file storage, or content from third-party providers. For this purpose a decision was made to invent and use custom URL schemes.

Some examples:

vitalsource://book/bookID/{book_id}/cfi/{cfi}
blackboard://content-resource/{file_id}
canvas://file/{file_id}
canvas://page/{page_id}
d2l://file/course/{course_id}/file_id/{file_id}/
jstor://{article_id}

This was in some respects convenient as it could work with existing code that expected documents to be identified via URLs, however it causes various issues:

The formats are not specified via a schema anywhere. This makes writing new code that works with them more difficult because it isn't clear what values the code might have to handle. Also we don't in general do a thorough and consistent job of validating what we generate, so errors can creep in.
Serializing and de-serializing URLs is more complex and error-prone than eg. JSON. See VSBookLocation as an example that parses VS URLs with regexes and doesn't escape URL components. Obviously we could use proper URL parsing and serialization functions here, but that is still more fiddly than it needs to be.
Certain reporting queries, specifically anything that would involve parsing the URL, are more complex than if the information was stored as separate DB columns or JSON fields.

Slack thread: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1700495733366949

hypothesis / lms

Revisit storage of document identifiers #5840