gutenbergtools / ebookmaker

The Project Gutenberg tool to generate EPUBs and other ebook formats.
GNU General Public License v3.0
84 stars 18 forks source link

Input HTML needs to be validated, not just generated HTML #192

Closed gbnewby closed 1 year ago

gbnewby commented 1 year ago

PG has had several recent uploads via https://upload.pglaf.org where the HTML had validity errors. The errors in the uploaded text were not reported in the output.txt of https://ebookmaker.pglaf.org

Upon investigation & discussion, it appears this is because ebm runs the validator (vnu.jar) against the generated HTML5, but not against the uploaded HTML.

Since the uploaded HTML is posted to the 1/2/3 filesystem, it needs to be validated.

Is this something to add to ebm? Otherwise, we could add a call to the validator before calling ebm in https://ebookmaker.pglaf.org

Here is a simple example with a one-line HTML file that has a validation error. Online ebm reports no errors (see https://ebookmaker.pglaf.org/cache/20230715224619/output.txt [will be automatically purged after 3 days]).

Running validator.w3.org directly spots the error, of course: Screenshot 2023-07-15 at 11 49 28 PM

Here's the simple file: test0715.zip

eshellman commented 1 year ago

Other than running time, I don't see any downside to running the validator or the input file before running ebm. In this case the error gets fixed by BeautifulSoup before Ebookmaker starts working on it, and warnings are not available.

windymilla commented 1 year ago

Would that mean that the validation errors would get reported to the user when they upload and/or in output.txt file? Since we now insist that the output.txt file doesn't have errors, that would be a good way of ensuring that files with validation errors don't get uploaded.

jj2017 commented 1 year ago

In my naivety I thought ebookmaker was running directly on the submitted file and that the current upload form was saying it was okay to submit the file because no problems in the submitted file were reported. If we can't have it running on the submitted file, the upload process needs again to flag validator errors, which, I think would, at this stage, be a step backwards.

eshellman commented 1 year ago

if it would help, I think I could add a 'prevalidate' option to ebookmaker.

gbnewby commented 1 year ago

No need - it's now been implemented for the online ebookmaker and is undergoing testing before going to production. It's easy enough to add a validation check on the uploaded HTML.

Thanks!

On Tue, Jul 18, 2023 at 5:19 AM Eric Hellman @.***> wrote:

if it would help, I think I could add a 'prevalidate' option to ebookmaker.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/192#issuecomment-1640106842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLXKS34TTROUCO72ELTXQZ5NXANCNFSM6AAAAAA2MFRHNI . You are receiving this because you authored the thread.Message ID: @.***>