Closed ashahabov closed 3 years ago
Sorry, I do not personally use HtmlConverter.
I believe that it is perfectly fine that word contains a ton of run
s and it is ok that converter transform each run
in span
.
Looks like HTML/CSS issue - https://stackoverflow.com/questions/5078239/how-do-i-remove-the-space-between-inline-inline-block-elements
As workaround we can generate HTML without formatting and newline
instead of
<p
dir="ltr"
class="pt-000000">
<span
class="pt-000001">ART</span>
<span
class="pt-000001">I</span>
<span
class="pt-000001">CLE</span>
</p>
generate
<p dir="ltr" class="pt-000000"><span class="pt-000001">ART</span><span class="pt-000001">I</span><span class="pt-000001">CLE</span></p>
p.s. I also see a lot of PR related to HtmlConverter - https://github.com/EricWhiteDev/Open-Xml-PowerTools/pulls (there is a chance that fix may be already there)
Thank you for the quick answer!
After some investigation, I just re-used code from WmlToHtmlConverter01 project and it works :)
Here is the code with correct output:
string docxPath = @"test.docx";
string htmlPath = @"test.html";
using WordprocessingDocument wDoc = WordprocessingDocument.Open(docxPath, true);
WmlToHtmlConverterSettings wmlToHtmlSetting = new ()
{
CssClassPrefix = "pt-",
};
XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, wmlToHtmlSetting);
var html = new XDocument(
new XDocumentType("html", null, null, null),
htmlElement);
var htmlString = html.ToString(SaveOptions.DisableFormatting);
File.WriteAllText(htmlPath, htmlString, Encoding.UTF8);
OUTPUT
Here is the code that I used before and that generates spaces:
string docxPath = @"test.docx";
string htmlPath = @"test.html";
using WordprocessingDocument wDoc = WordprocessingDocument.Open(docxPath, true);
HtmlConverterSettings htmlConverterSetting = new()
{
CssClassPrefix = "pt-"
};
XElement htmlElement = HtmlConverter.ConvertToHtml(wDoc, htmlConverterSetting);
File.WriteAllText(htmlPath, htmlElement.ToStringNewLineOnAttributes(), Encoding.UTF8);
Looks like WmlToHtmlConverter
had to be used instead of HtmlConverter
. Actually, do not know the difference between them. Will learn later.
There is test.docx created from .pdf via PDF Focus .NET library:
HtmlConverter's HTML outcome (test.html.zip) is:
I think the reason of gap between letters is PDF Focus .NET put almost all letter into own:
Anyway, I think it is a bug on our HtmlConverter side. Or is there some option to avoid such result?