harvard-lil / h2o

H2O is a web app for creating and reading open educational resources, primarily in the legal field
https://opencasebook.org
GNU Affero General Public License v3.0
37 stars 30 forks source link

Allow PDF exports to be driven by server-side annotation code #1925

Closed lizadaly closed 1 year ago

lizadaly commented 1 year ago

This leverages the changes from #1920 to allow passing an option to the model -> HTML -> DOCX pipeline to say "please insert annotations and notes adjacent to their insertion points instead of at the end of the chapter." This is necessary for PagedJS to lay out footnotes from the correct position.

This also finishes earlier work to fork the PDF-centric layout away from reading mode:

The actual Django view and output templates are the same in reading mode and PDF view, though I think the templates should probably also be split eventually, as they don't share much code anymore.

Performance

For small < 500 page casebooks, PDFs render on my laptop in under 15 seconds. For very long casebooks the story... isn't great, but it's not infinitely long as when doing client-side annotation rendering.

For a very long (1,200 page) casebook, the DOCX pipeline takes 15 seconds on my laptop and the PDF one takes a bit over a minute. I don't love it! It may be worse in staging/prod—looks like the process is mostly CPU-bound.

Almost all of that time is in PagedJS itself segmenting the pages and producing nice artifacts like real footnotes:

image

CSS Paged Media does not support footnotes as a first-order thing, and I think with potentially multiple footnotes per page it'll be difficult to get something more lightweight that is also resilient to lots of different markup. I'll probably timebox some experiments though! If that doesn't work out I think we could do a lot with eagerly caching PDF exports since in the vast majority of cases they won't need to be on-demand.

Example

image
lizadaly commented 1 year ago

Wow sure looks like I broke something.