harvard-lil / h2o

H2O is a web app for creating and reading open educational resources, primarily in the legal field

GNU Affero General Public License v3.0

36 stars 30 forks source link

Background

For PDF export of very long books, handling annotations in the browser is prohibitively time-expensive. In earlier work I optimized this for screen display by deferring expensive transformations, but in the PDF export, all transformations need to be done up-front before the document is produced.

Luckily, we already have a server-side mechanism for introducing annotations that's known to work—the transformations that are called by the docx export pipeline that produces HTML for consumption by pandoc.

My current plan is to use the browser-friendly and accessible markup only for reading mode, and to reuse the markup generated for docx export for PDF export. I'll use PagedJS + Playwright for this because that's known to work, even if it isn't extremely fast. (I think the absolute best option for headless PDF generation in 2023 is probably https://weasyprint.org/, which unlike pandoc's PDF export is self-contained, and unlike most other open source PDF tools is still in active development.)

This actual PR

In order to support footnotes in PagedJS, the export flow will have to take an option to insert notes and links adjacent to their source rather than at the bottom of the document. But as written this export code is a little hard to work with—it's a method on a model containing an inner class containing long methods, with some complicated docstrings for testing.

Before I make any changes to it I wanted to have it in a state that was a little easier to work with, so this moves the code out to its own module and and refactors the assertions to individual pytests with smaller scopes more appropriate for unit tests. I also added type annotations to much of the code, and wrote a template tag to call the new function since it's no longer a callable method. Actual test coverage and functionality should be identical, and I verified it with a few large casebooks. I'll also test on staging once merged.

Codecov Report

Merging #1920 (ee17b31) into develop (aa052b5) will increase coverage by 0.20%. The diff coverage is 90.05%.

@@             Coverage Diff             @@
##           develop    #1920      +/-   ##
===========================================
+ Coverage    76.82%   77.02%   +0.20%     
===========================================
  Files           56       58       +2     
  Lines         6731     6773      +42     
===========================================
+ Hits          5171     5217      +46     
+ Misses        1560     1556       -4

Impacted Files	Coverage Δ
web/main/celery_tasks.py	`47.36% <ø> (ø)`
web/tasks.py	`23.54% <64.28%> (ø)`
web/main/export.py	`92.56% <92.56%> (ø)`
web/config/celery.py	`100.00% <100.00%> (ø)`
web/main/models.py	`74.01% <100.00%> (-0.96%)`	:arrow_down:
web/main/templatetags/export_node_html.py	`100.00% <100.00%> (ø)`
web/main/test/functional/test_platform.py	`57.81% <100.00%> (ø)`
web/main/test/test_export.py	`93.25% <100.00%> (+3.42%)`	:arrow_up:
web/main/views.py	`71.49% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

harvard-lil / h2o

Refactor docx html export code to its own module #1920

Background

This actual PR

Codecov Report