sergey-tihon / Clippit

Fresh PowerTools for OpenXml
https://sergey-tihon.github.io/Clippit/
MIT License
50 stars 19 forks source link

HtmlConverter: gaps between letters #22

Closed ashahabov closed 3 years ago

ashahabov commented 3 years ago

There is test.docx created from .pdf via PDF Focus .NET library:

image

HtmlConverter's HTML outcome (test.html.zip) is:

image

I think the reason of gap between letters is PDF Focus .NET put almost all letter into own :

image

Anyway, I think it is a bug on our HtmlConverter side. Or is there some option to avoid such result?

sergey-tihon commented 3 years ago

Sorry, I do not personally use HtmlConverter. I believe that it is perfectly fine that word contains a ton of runs and it is ok that converter transform each run in span.

Looks like HTML/CSS issue - https://stackoverflow.com/questions/5078239/how-do-i-remove-the-space-between-inline-inline-block-elements

As workaround we can generate HTML without formatting and newline

instead of

      <p
        dir="ltr"
        class="pt-000000">
        <span
          class="pt-000001">ART</span>
        <span
          class="pt-000001">I</span>
        <span
          class="pt-000001">CLE</span>
      </p>

generate

      <p dir="ltr" class="pt-000000"><span class="pt-000001">ART</span><span class="pt-000001">I</span><span class="pt-000001">CLE</span></p>

p.s. I also see a lot of PR related to HtmlConverter - https://github.com/EricWhiteDev/Open-Xml-PowerTools/pulls (there is a chance that fix may be already there)

ashahabov commented 3 years ago

Thank you for the quick answer!

After some investigation, I just re-used code from WmlToHtmlConverter01 project and it works :)

Here is the code with correct output:

string docxPath = @"test.docx";
string htmlPath = @"test.html";

using WordprocessingDocument wDoc = WordprocessingDocument.Open(docxPath, true);
WmlToHtmlConverterSettings wmlToHtmlSetting = new ()
{
    CssClassPrefix = "pt-",
};
XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, wmlToHtmlSetting);

var html = new XDocument(
    new XDocumentType("html", null, null, null),
    htmlElement);

var htmlString = html.ToString(SaveOptions.DisableFormatting);
File.WriteAllText(htmlPath, htmlString, Encoding.UTF8);

OUTPUT image


Here is the code that I used before and that generates spaces:

string docxPath = @"test.docx";
string htmlPath = @"test.html";

using WordprocessingDocument wDoc = WordprocessingDocument.Open(docxPath, true);
HtmlConverterSettings htmlConverterSetting = new()
{
    CssClassPrefix = "pt-"
};
XElement htmlElement = HtmlConverter.ConvertToHtml(wDoc, htmlConverterSetting);

File.WriteAllText(htmlPath, htmlElement.ToStringNewLineOnAttributes(), Encoding.UTF8);

Looks like WmlToHtmlConverter had to be used instead of HtmlConverter. Actually, do not know the difference between them. Will learn later.