matiskay / html-similarity

Compare html similarity using structural and style metrics
BSD 3-Clause "New" or "Revised" License
210 stars 23 forks source link

Support load html from bytes #107

Open beppler opened 1 year ago

beppler commented 1 year ago

Add support to load documents from byte arrays.

We have to deal with some xhtml documents that have processing instructions on the beginning like the following one:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
    <body>
    </body>
</html>

These ones can not be loaded from strings because xml.html.parse do not support the encoding instruction when source is string.