sasansom / sedes

Metrical position in Greek hexameter.
9 stars 3 forks source link

Make `tei.TEI` parser raise an error when XML is not well-formed #89

Open whoopsedesy opened 10 months ago

whoopsedesy commented 10 months ago

Commit 99fea693629801ea07c588543ec5214bff117142 caused homerichymns.xml not to be well-formed (#88). But processing the file with e.g. tei2csv does not raise an error. XML that is not well-formed should be a fatal error in any programs that take TEI as input.

whoopsedesy commented 10 months ago

The way we now parse TEI is using BeautifulSoup, like so:

bs4.BeautifulSoup(f, "xml")

This is the way the documentation says to parse XML; it's surprising to me that it does not throw an exception for well-formedness errors, as if it were HTML. It doesn't look like any of the other parser options ("html.parser", "lxml", "html5lib") will raise errors either.