onizet / html2openxml

Html2OpenXml is a small .Net library that convert simple or advanced HTML to plain OpenXml components. This program has started in 2009, initially to convert user's comments from SharePoint to Word.
MIT License
291 stars 105 forks source link

Table of Contents (Hrefs) do not work. #150

Closed paulius-petkus closed 1 day ago

paulius-petkus commented 1 week ago

Describe the bug Html page has "table of contents" - hrefs to other parts of the same Html document. After the HTML --> DOCX conversion, table of contents lines navigate not to the concrete section, but to the very beginning of the document.

Expected behavior Hrefs should navigate into sections.

Repro

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using HtmlToOpenXml;
.
.
.

    static void ConvertHtmlToDocx(string html, string filePath)
    {
        using var wordDoc = WordprocessingDocument.Create(filePath, WordprocessingDocumentType.Document);
        // Add a main document part.
        MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

        // Create the document structure and add some text.
        mainPart.Document = new Document();
        var body = new Body();
        mainPart.Document.Append(body);

        var converter = new HtmlConverter(mainPart);
        converter.ParseHtml(html);

        mainPart.Document.Save();
    }

Edit: Removed attachments

onizet commented 6 days ago

Hello, nothing better than working on a real use case! The current implementation is looking for the target using the idattribute. According to W3C, name attribute is obsolete but is still supported by modern browser. I will amend the code to support this too.

onizet commented 1 day ago

Thank you again for your real sample, it helps me to troubleshoot a lots on whitespaces and headings. You can delete the attachment as it may contains confidential data.