salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Clean HTML prior to being loaded into DOMDocument/Crawler #106

Open derklempner opened 4 years ago

derklempner commented 4 years ago

Description Sometimes malformed HTML can cause PHP DOMDocument/libxml to choke/generate a DOM representation that is different from the HTML you may be expecting. This can cause selectors to fail.

Proposed solution Optionally clean the html before it is processed (using something like html tidy or other tool).