markbeaton / TidyManaged

Managed .NET & Mono wrapper of the HTML Tidy library
75 stars 39 forks source link

Save fixed html issue #1

Closed rpaczkow closed 13 years ago

rpaczkow commented 13 years ago

Hi! I am trying to fix html from address http://stooq.com/notowania/?kat=g2. After saving page to harddisk I use this code below to fix errors and save to disk and I get empty fixed.html file.

using (TidyManaged.Document tdoc = TidyManaged.Document.FromFile(@"my.html")) { tdoc.ShowWarnings = true; tdoc.Quiet = true; tdoc..OutputXhtml = true;

            tdoc.CleanAndRepair();
            String s = tdoc.Save();

            if (File.Exists(@"fixed.html"))
                File.Delete(@"fixed.html");

            File.WriteAllText(@"fixed.html", s);
        }
markbeaton commented 13 years ago

Apologies for taking so long to look into this...

The problem is due to the amount if errors Tidy is encountering parsing your HTML. By default, if any errors are found (as opposed to warnings), Tidy will not produce any output, and will give up parsing after 6 errors.

To override these defaults, try this:

using (TidyManaged.Document tdoc = TidyManaged.Document.FromFile(@"my.html"))
{
    tdoc.ShowWarnings = true;
    tdoc.Quiet = true;
    tdoc.MaximumErrors = int.MaxValue;
    tdoc.ForceOutput = true;
    tdoc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
    tdoc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
    tdoc.OutputXhtml = true;

    tdoc.CleanAndRepair();
    String s = tdoc.Save();

    if (File.Exists(@"fixed.html"))
        File.Delete(@"fixed.html");

    File.WriteAllText(@"fixed.html", s);
}